A new paper published by David Pujol and colleagues investigates the impact of differential privacy mechanisms on equitable decision making and resource allocation.
Building on the prior work of scholars like Dwork & Mulligan and Bagdasaryan et al., Pujol et al. simulate use of various privacy algorithms on public data sets released by the U.S. Census Bureau to quantify potentially unfair impacts of their usage on certain groups and communities.
Their findings indicate that reliance on data processed via ε-differentially private mechanisms (whose methodology relies on injection of statistical noise to achieve reasonable guarantees of privacy) may misrepresent certain groups and impact vital decisions informed by these data.
One of many recent papers on the topic of privacy algorithm efficacy and impact, the piece speaks to ethically complex tradeoffs between the maintenance of data privacy, accuracy, and accuracy disparities in societal decision-making contexts.
In this article, I aim to break down some of these findings and why they matter in the overall conversation around algorithmic transparency.
What is Diffeential Privacy?
Differential privacy (DP) is a mathematical property of a randomized algorithm that bounds the ratio of output probabilities induced by changes to an individual’s data such that indistinguishable outputs are produced for any pair of possible data sets that differ in a single case (i.e., “neighboring” data sets). Because of this, DP offers each case in a given data set a mathematical guarantee that any output computed from a potentially sensitive input is almost as likely as if the case was not contained in the data set at all.
In other words, employing DP ensures that adding or removing any single case from a data set does not considerably change the probability of any given output — thus offering a formal model of privacy.
In ε-DP, the tuning of a privacy parameter (denoted “ε” and pronounced “epsilon”) determines the strength of the privacy guarantee. The lower the value of ε, the greater the indistinguishability of results (i.e., the greater the privacy protection).
Given the appeal of their stringent privacy guarantees, varying DP methodologies have been adopted by numerous massive personal data proprietors (Apple, Google, Microsoft) as well as public statistical agencies like the U.S. Census Bureau — who integrated the use of ε-DP in the release of 2020 U.S. Census data.
Why Does This Matter?
Public statistics (such as those released by the U.S. Census), guide societal decisions including the amount of funds allocated to school districts, the number of seats apportioned to regions in legislative bodies, and which languages election materials must be available in across given electoral jurisdictions. Given this, when decisions relating to resource allocation are made using ε-differentially private data sets, the statistical noise injected to achieve formal guarantees of data privacy can disproportionately affect groups (primarily minoritized groups) whose fair representation holds stake in data accuracy.
What Does That Mean?
When we use algorithms (like those that meet the assumptions of DP) to achieve formal guarantees of privacy, error is introduced to the data set that could potentially distort data-informed decisions. If we’re willing to tolerate some privacy loss, we can have high assurances of data accuracy. By contrast, when we’re willing to tolerate some loss of data accuracy, we can have high assurances of data privacy.
While it’s known that use of privacy-preserving methodologies tends to come at the expense of outcome accuracy, a key question posed by Pujol et al. asks, “if we accept that privacy protection will require some degree of error in decision making, does that error impact groups or individuals equally?”
In other words, will groups affected by the error inherent to using privacy preserving methodologies bear the burdens of data inaccuracy in the same ways? Or will some groups be put at a disproportionate disadvantage?
“If we accept that privacy protection will require some degree of error in decision making, does that error impact groups or individuals equally?”- Pujol et al.
Tell Me More…
In examining assignment problems informed by differentially private versions of public data, Pujol et al. found that different groups can experience disparities in accuracy resulting from the use of formal privacy mechanisms. These accuracy disparities appeared in their simulated applications as unequal error rates in estimated counts, bias in estimated counts, and unequal outcomes when compared to thresholds identified in the original input data.
When looking at minority language benefit classification for voting access, because noise injection may be inadvertently biased in one direction, some counties can experience lower rates of correct classification than others. This can result in access disparities to election materials in non-English languages (even in jurisdictions that should qualify under the Voting Rights Act).
In looking at funding allocation for school districts, adoption of strict privacy parameters (ε=10^-3), can result in some districts receiving over 500x their proportional share of funds while others receive less than ½ of their proportional share.
And, of course, as reported by the National Congress of American Indians, noise injection can affect the accuracy and quality of statistics about small populations living in remote areas (including tribal nations spanning across multiple U.S. state boundaries), if not completely erasing them.
Final Thoughts
Maintenance of data privacy is a fundamental component of data justice work, scholarship, and advocacy. However, it is also imperative to mitigate and address accuracy disparities resulting from use of privacy-preserving methodologies. Beyond commitment to meaningful data privacy, data justice work embraces a commitment to fairness in the ways individuals and communities are made visible through data. To complement the longstanding focus on aggregate error metrics, privacy algorithm designers must evaluate and communicate the fairness of outcomes to achieve meaningful algorithmic transparency.