How Statisticians Should Grapple with Privacy in a Changing Data Landscape

Suppose you had a data set that contained records of individuals, including demographics such as their age, sex, and race. Suppose also that these data contained additional in-depth personal information, such as financial records, health status, or political opinions. Finally, suppose that you wanted to glean relevant insights from these data using machine learning, causal inference, or survey sampling adjustments.

What methods would you use? What best practices would you ensure you followed? Where would you seek information to help guide you in this process?

Now, consider a different hypothetical scenario. Suppose you had a data set that contained records of individuals, including demographics such as their age, sex, and race. Suppose also that these data contained in-depth personal information, such as financial records, health status, or political opinions. Finally, suppose you believed that other people should have access to this data set so they could use this information.

How would you share it? What information would you preserve in the data? What information would you consider too sensitive to share? Where would you seek information about the best practices to go about it?

A significantly higher percentage of readers probably will have answers to the questions posed in the first hypothetical scenario than to those in the second, which begs the question of why. Statisticians often use public microdata or tables, or access sensitive data through restricted data centers or agreements. Yet, few develop and implement data privacy and confidentiality methods that enable that access.

While statistical disclosure control (SDC), or limitation, has existed as a subfield of statistics since the mid-20th century, most statisticians never learn how to use SDC techniques, which aim to alter data while preserving statistical utility. Now, with the rapid growth in data collection and sharing, more data practitioners and researchers require statistical data privacy tools and resources to handle the changing data landscape.

Over the past two decades, this change in the amount of data collected by researchers and administrators stems from the ability to field surveys, run experiments, and harvest massive amounts of observational data from the internet more easily. The scientific community also has increased demands for transparency; for example, some funding agencies now require or consider requiring data-sharing plans for grantees. A lack of access to underlying data can even lead to embarrassing retractions of published articles, as recently occurred with papers about COVID-19 treatments published in The Lancet and the New England Journal of Medicine.

As these trends continue, more statisticians need to understand and use data privacy techniques to manage their data and share them safely. One may respond that the responsibility for making data private and safe to distribute falls on the shoulders of privacy experts, but there are so few of these experts that they cannot handle the ever-growing quantity of data that researchers collect and desire to share.

While more-complex and large data products will need expert attention, all statisticians and data scientists should become better acquainted with the general data privacy tools and expect to incorporate them as part of the research process. Similarly to how statisticians and data scientists of many types frequently use the tools of causal inference or machine learning, we should insert data privacy into the general research framework and statistical toolkit.

The privacy landscape is also undergoing significant changes, mostly due to the rapid transformation in the data and computational landscape. To understand privacy tools, statisticians should familiarize themselves with both the more-traditional SDC methods and newer approaches. The conversation about data privacy has recently focused on a newer privacy standard known as differential privacy (DP), proposed by Dwork, et al. (2006). This definition and the corresponding methods to meet it are still considered part of the broader SDC framework, but it is frequently juxtaposed against older SDC methods due to significant differences between them.

While most data maintainers (or data curators) in academia, government, and industry still extensively use both traditional methods of SDC and access restrictions such as secure data centers to protect privacy, DP has garnered much attention. Many researchers and data maintainers are moving to develop and implement methods that satisfy DP, but far from being universally accepted, DP has invoked intense debate among statisticians, computer scientists, and social data researchers, among others. This debate has centered on the trade-off between protecting privacy and preserving usefulness of the data.

DP offers a promising framework, but it remains one of multiple tools in the privacy toolbox. As shown in Figure 1, DP mechanisms replace traditional SDC methods for releasing noisy statistics or synthetic data sets based on the data. DP formally offers a solution to only one side of the problem: the rigorous definition of privacy loss. On the other hand, the current DP methodology has drawbacks when it comes to preserving the utility of the data or carrying out valid statistical inference. These matters need much more attention from the privacy community to make DP viable and practical.

Figure 1. Diagram of SDC and DP frameworks.

Beyond that, statistical methods of disclosure control have never been sufficient for the entire privacy landscape. Secure data centers, data use agreements, and legal protections remain a vital element and will continue to be so, but when it comes to producing public use data or statistics, the DP framework will, at least loosely, define the future of statistical data privacy.

With this in mind, it is useful to discuss the current issues in the debate over DP and differences between DP and traditional SDC approaches to highlight challenging areas to tackle for statisticians looking to enter the privacy research realm, or those who are already part of it. Overcoming these barriers will play a key role in advancing statistical data privacy for practical applications and increasing the potential tools for data sharing.

How to Define Privacy?

To tackle a problem requires fully understanding it. When implementing a data privacy preserving method, first define what “privacy loss” means and how to measure the risk of that loss. Starting at this base level shows how quickly SDC and DP diverge in their approaches to protecting privacy.

One of the strengths of previous SDC methods, which are designed to be intuitive and easily understood, is how they define privacy loss risk. For example, k-anonymity, one of the most commonly used SDC definitions, requires that any released data set contain at least k observations for each combination of possibly identifying variables (such as age, sex, or race) in the data. The data can be altered or suppressed to achieve a certain level of k-anonymity. We understand this definition intuitively, because we know that a higher k implies there are more people in the data with the same characteristics, and identifying someone based on those characteristics would be less likely.

Other intuitive SDC definitions of risk rely on the probability of linking other external data to the released records or estimating the population uniqueness of a given record. In these paradigms, suppressing or altering the data until post-hoc analyses reveal appropriately low levels of risk minimizes the definition of risk.

DP, on the other hand, tosses out all these definitions and starts from scratch. It observes, correctly, that those definitions rely on (1) accurately understanding the way an attacker (or intruder) might try to uncover information in the data and (2) knowing the universe of external data sets that a malicious party would use to aid an attack. Not relying on these assumptions significantly strengthens the privacy guarantee offered by satisfying DP, because DP does not assume an accurate assessment of the attacker’s behavior or what information that attacker might use to extract sensitive information from the data.

These assumptions can easily be wrong, which weakens the actual protection provided by traditional methods. DP, on the other hand, provides levels of protection that can be provably shown to hold if the definitions are properly satisfied. Unfortunately, addressing these challenges comes at the cost of a clear meaning of risk, and DP does not actually have an intuitive interpretation.

DP instead quantifies privacy loss with a parameter, ϵ, which represents a bound on the log-odds for the probability that the protection scheme produces any particular output, given that any individual is in the data versus the probability that it produces the same output given any individual is not in the data. This concept can be explained colloquially by saying that a large ϵ means the released information is more likely to reveal something about individuals in the data than a lower ϵ.

Furthermore (and a point of some confusion), the data must be protected over the entire data domain to satisfy DP. This replaces the traditional SDC notion of determining which variables in the data might be sensitive and hashing out ways an attacker would try to learn potentially sensitive information. Instead, DP requires that the data should be protected assuming any possible individual could be in the data and that a data attacker could be armed with any currently unknown or potential future information, not simply the best available information to those applying privacy protections.

To accomplish this, researchers must modify the DP algorithms to offer protection for any possible individual’s data that fall in the domain of the database. This modification can be difficult when the domain is not clearly understood or if it is not bounded, such as some continuous variables that have no discernible upper-bound. This concept of considering the entire possible domain space is difficult to master quickly and lacks the same intuitive interpretation of older definitions.

Which is better: the more-intuitive definitions of assessing risk from traditional SDC methods, which require more-strenuous assumptions that may be violated in the future, or a formal definition of privacy from DP, which is harder to both achieve and understand? Ultimately, the differing opinions come down to a policy argument, which highlights the fact that these issues should not be discussed as a purely technical matter.

The interplay of policy privacy and technical privacy ties in directly with the natural next question to ask after defining privacy loss.

How Much Privacy Loss to Incur?

To determine privacy loss, traditional SDC methods rely heavily on domain experts or public policymakers to provide social context involved in the data. For instance, domain experts or policymakers can indicate which variables or individuals are more sensitive than others, such as tax return status or those who earn a certain income level. To assess the privacy risks, data privacy researchers and domain experts assume and predict the potential data attacker’s background knowledge and behaviors.

Although leveraging expert and public policy knowledge to determine privacy loss makes intuitive sense, some data privacy researchers criticize this approach to assess disclosure risk for being too ad-hoc. This criticism extends to some SDC methods that rely on these measures of privacy loss to protect the data.

In contrast, DP formally quantifies privacy loss as previously discussed, but even still, DP researchers and practitioners continue to debate how to pick an appropriate value for the privacy parameter, ϵ. At least in part, this debate comes from differing views on how to interpret the parameter. Early work suggested that ϵ values less than or equal to 1 were preferable, and values above 2 were considered highly inadvisable.

This reasoning was based on comparing the differentially private and original data outputs for various statistical inferences at varying values of ϵ, such as point estimates and power in hypothesis testing. However, more-recent practical applications have used much higher values (e.g., 8), and some DP researchers suggest that there is no real upper bound on the parameter. This perspective treats ϵ simply as a “privacy budget,” where a larger value implies that more privacy is “spent” and therefore fewer protections are guaranteed for individuals.

In this framework, choosing ε becomes a public policy decision for stakeholders and domain experts where the amount of privacy spent must be balanced against the social utility of the released information. While this framing is now quite common, the data privacy and confidentiality community lack sufficient tools and resources to help policymakers understand what choosing one value of ϵ versus another means in reality for individuals.

For instance, if policymakers decide to treat ϵ like a budget, they will need guidance on how to assign a finite privacy budget in a system where researchers can repeatedly query or analyze the data. Specifically, for a method to satisfy DP, every time information (e.g., a statistic or a synthetic data set—data with pseudo records that are statistically representative of the original data) is released based on the data, the DP algorithm “spends” a certain amount of the privacy budget. If there is no cap on the budget, then the privacy loss will continue to accrue each time the data are queried.

Over time, the privacy budget approaches infinity and the information gleaned from the “protected data” should be the equivalent of publicly releasing the unaltered data.

On the other hand, if the total budget is fixed, it implies that eventually all use of the data must cease once researchers spend the entire budget.

One potential solution to avoid overspending the privacy budget is to release a fixed synthetic data set and then nothing else, or, if using a query system (a mechanism for researchers to submit their analyses), that it simply stops functioning once the budget is gone. While this may be possible, researchers and public policymakers would probably demand continued access if the data still existed.

A limited budget raises another challenge in how to allocate the privacy budget for researchers wishing to use the data. Should anyone wanting access be given a privacy budget, or should there be criteria? If selected, how much ε should each person receive?

These questions quickly escalate into issues of fairness and equality. As a way to expand the budget and increase equality, some DP researchers propose to partition the data for each researcher requesting access when working with continually collected data, such as survey results. The data maintainer could replace older partitions that had their privacy budget exhausted with more-recently collected partitions (assuming no lack of representation issues).

Another possibility is that the data would only be used through restricted access, such as through data centers after the budget is exhausted. However, this approach raises its own complications concerning publishing any information based on the restricted analyses. Nor does this address the fairness and equality issues for researchers who do not have the means to gain restricted access, such as living far away from a secure data center.

Altogether, these situations present difficult questions that policymakers must answer to implement DP systems.

How to Communicate Privacy Concepts?

Considering the complexity of defining privacy and determining how much privacy loss to incur, setting appropriate privacy policy requires communicating these ideas clearly and effectively. Materials explaining traditional SDC methods to policymakers include several books, website descriptions, articles, blogs, and more that are geared toward people from a variety of backgrounds, from nontechnical practitioners to experts. This wealth of communication materials stems from both the field having existed for several decades and the relative ease of explaining these methods.

In contrast, very few written communication materials were directed toward those who are not experts or computer scientists for some years after DP was first introduced. The lack of easily understood DP materials intensified data maintainers’ and practitioners’ resistance to adopting DP over common SDC methods.

More recently, members of the DP community have attempted to create materials to address this need, such as “Differential Privacy: A Primer for a Non-Technical Audience” (Nissim, et al. 2017), written with a legal audience in mind. Similarly, after the U.S. Census Bureau and large tech companies such as Google announced their use of DP with their products, a surge of videos, blogs, and general DP communication outreach emerged.

While a step in the right direction, many of these materials still struggle to communicate the ideas of DP in nontechnical terms and ways that researchers of various backgrounds can understand. The primers have largely been framed for a computer science audience or statistical data privacy practitioners and researchers. As a good example of attempting to fill this gap, “Differential Privacy and Social Science: An Urgent Puzzle” in the Harvard Data Science Review addressed social science researchers particularly and focused on framing the problems in terms relevant to policymakers and those likely to use these data.

To aid in communicating these ideas, consistent terminology for the field must be developed. This issue mostly exists due to the diverse research disciplines that intersect with DP, such as statistics, computer science, social science, and economics. To use one small example, what statisticians refer to as a marginal table is often called a histogram in computer science. Currently, organizations such as the National Institute of Standards and Technology (NIST) and the Federal Committee on Statistical Methodology (FCSM) are trying to fill that void by developing stronger standards for communication and tools for privacy researchers.

For instance, FCSM is working on curating and compiling an online data protection toolkit that incorporates DP into the broader SDC picture. This toolkit aims to include “…templates based on best practices for assessing, managing, and mitigating the re-identification risk of individuals or enterprises in U.S. federal data products.” Works such as these enable a common language, which is especially necessary for equipping policymakers and practitioners apply privacy tools appropriately.

How to Formalize Privacy Tools?

Privacy policy, once appropriately defined and communicated, must have corresponding tools to put it into action. One of the reasons traditional SDC methods gained large-scale adoption was the computation tools available that can be easily used by non-experts. For instance, synthpop in R assists in generating synthetic data and evaluating their utility. If the DP community wants a similar wider acceptance and use of DP methodology, they need more open-source tools.

Starting in 2020, OpenDP, a research group based at Harvard University, is “…engaging a community of collaborators in academia, industry, and government to build trustworthy, open-source software tools for privacy-protective statistical analysis of sensitive personal data.” Their overarching goal is to create open-source tools to implement DP methods more easily.

As a first step, the group released WhiteNoise, a DP library that contains code (rust, python, and R) to generate and apply DP statistics, mechanisms, and utilities. OpenDP also hosted a workshop, inviting data privacy researchers to generate ideas and solicit feedback about how OpenDP should move forward, with the goal of building broad collaboration.

OpenDP opens an avenue for researchers to start learning to use these privacy tools on their own. There are a few courses and tutorials on implementing traditional SDC techniques, although they are not as widely accessible outside statistics. Future tutorials should be similar to causal reasoning or machine learning, where researchers from a wide background can study best practices and learn common tools. The increase in researchers becoming familiar with these methods will alleviate the need for data privacy experts to handle all data releases and expand data-sharing possibilities.

Again, more-complex systems will still require experts, but for more-straightforward problems, researchers should be empowered to handle privacy themselves, and have standardized quality assurance as for any other type of data analysis.

What Kinds of Data Present Difficulties for Protecting Privacy?

While choosing appropriate definitions, creating communication materials, developing tools, and setting policy are all vital to the privacy debate, certain technical limitations exist about the type of data that can be safely shared. Even for data privacy experts, striking the balance between ensuring adequate privacy protection and preserving the utility (or usefulness of the data) is extremely difficult for certain types of data.

These issues are not unique to DP. On the contrary, traditional SDC practitioners have generally avoided producing public data for the types of problems elaborated in the following paragraphs and have opted instead to use restricted data access. Understanding the natural limitations of statistical data privacy tools on certain types of data helps inform what policy choices made to protect those types of data.

Small populations present an archetypal example of the tricky tradeoff between protecting privacy and preserving the utility of data. To help explain, imagine collecting demographic and financial data about all families living within the United States. One goal is to ensure that the identity of any family participating is private, given financial information is sensitive, but it is important to keep the demographic and financial representation accurate in the data, such as determining where to target stimulus relief during economic recessions.

Now, focus on one of these examples: a Japanese-American family of four in New York City. Since the city is large and racially diverse, they can easily be “hidden” in the data while preserving the statistical qualities, such as how many Japanese-Americans live in certain neighborhoods of New York City with a certain income level. However, if the family lived in a remote area such as Wyoming (smallest populated state in the U.S.), where there are very few Asian Americans in general, hiding the Japanese-American family is much harder.

For more privacy, both DP and traditional SDC methods will either alter the Japanese-American representation significantly or remove them entirely from the data (e.g., suppression). For more data utility, anyone who can access the data will find exactly (or close to) the number of Japanese-Americans living in Wyoming.

Figure 2. Data representation of five largest ethnic demographics in New York City, New York and Wyoming.
Data collected from DataUSA.io.

This simple example highlights how privacy and utility oppose one another. Both traditional SDC and DP methods tend to either entirely misrepresent or remove the target small population. Traditional SDC methods would be most likely to rely on suppression in this situation and would probably suppress the minority populations. Not suppressing them would leave them exposed to higher re-identification risk.

Because DP strives to protect all possible individuals in the data equally, rare records within the data will be highly sensitive to being altered, and the released data will probably entirely misrepresent the truth (e.g., the altered, or sometimes referred to as noisy, data will claim there are dozens of Japanese families in a town of 100). Traditional SDC perturbative methods similarly distort small values.

Small populations can be taken to the furthest extreme, with personalized medicine, for example. In this case, the population of interest has n = 1, and it is clearly impossible to both protect privacy and study the population of interest. This problem remains an open question about how large a population must be for the information to be both protected and preserved in the data.

Other scenarios would make protecting the Japanese-American family very difficult. If their movement throughout the day could be tracked and their demographic information kept within the data, others could easily identify the family and infer their routine. With this information, someone could, for example, determine where the kids go to school, what time they come home, and when they would be alone.

On the flip side, if someone only knew their movements, the Japanese-American family could be identified based on the location data. Clearly, adding location and time to the data causes more difficulties in protecting people’s privacy.

Social media and social network data also present problems by introducing additional information about how people interact with one another. If someone correctly identified the Japanese family in a social network data, they could potentially discover who else in the data is likely to be of Japanese descent based on how, when, and at what frequency the family interacts with others within the data.

In addition to these concerns over identifying minority groups, typical ways to apply traditional SDC and DP methods tend to create inequality issues by either heightening privacy risks or providing fewer societal benefits for marginalized individuals. For example, since Wyoming has low racial diversity, any publicly released data with racial demographic information are more likely to create a privacy violation for minorities such as Japanese-Americans than Caucasians living in the same region.

While data privacy and confidentiality techniques can protect these individuals’ privacy, the induced misrepresentation of the marginalized groups due to altering the data may keep these individuals from reaping the same societal benefits from the data as majority groups, such as research outcomes or administrative purposes for economic stimulus programs.

Practice Makes Perfect! We Need More Use Cases

Tackling a problem in practice enables understanding it more deeply. While practitioners have implemented SDC for years, only a few DP use cases exist. Although many challenges that both traditional SDC methods and DP face can be perceived and explained, data privacy experts simply have more experience in working with traditional SDC in practice than with DP.

This situation places DP at a disadvantage, because practical applications help practitioners discern real data problems and devise new ways to handle them that theory alone might not reveal.

Because most of the debate between older methods and DP has been academic, many people assume either that DP will work or will not work in practice. The reality is more complicated. Many researchers propose theoretically optimal methods that satisfy DP, but they may depend on assumptions that often fail with real data.

On the other hand, the assumptions may be satisfied, but the approach is optimal in a way that is irrelevant for the actual data use. For example, theoretically minimizing error bounds on the amount of random noise added to alter the data does not necessarily translate to the way the data are analyzed. Theoretical developments are crucial, but without tying the theory specifically to the way data are analyzed or used, there is a risk of producing methods that do not help in practice.

Emphasizing practical applications also allows drawing in a broad range of disciplines for input. Depending on what data users actually need in practice, methods will have to be created that produce a specific type of output, such a statistic, a data set, or a visualization. Desired outputs vary significantly among disciplines, so it is necessary to draw in individuals with a wide variety of problems.

On the other hand, the few DP applications that exist are based on highly complex data systems instead of tackling simpler problems first. These use cases require an extensive amount of resources and have faced critiques for their shortcomings. One example is the U.S. Census Bureau’s implementation of the 2020 Decennial Census.

The Decennial Census is a massive undertaking, with significant restrictions about how the data must be produced, and is used by a large number of policymakers, practitioners, and researchers. Many practitioners and researchers who use Census data products have been pushing back against the Census Bureau’s adoption of DP. They fear they will lose essential parts of the data because DP will alter, and in some cases, significantly reduce, the utility of the public data.

These fears are partially founded because stronger privacy protection will naturally result in less information and, in the Decennial Census, small populations in particular will be affected. However, since most practitioners and policymakers do not have experience in working with DP data, some of the fear stems from the fact that DP is not well-understood in the community.

Other examples include the tech world’s forays into privacy. Google, Apple, and Uber have all implemented DP on some of their products, but they all suffered criticism from the privacy community for failing to adequately meet DP standards. For the most part, these critiques arose because the companies sacrificed privacy to make their products work as desired.

In Google’s case, with Chrome data, they did not track how much ϵ was spent and could not compute it for the user data. Apple applied a very high value of ϵ for their iPhone emoji texting, and Uber similarly did not track results in a way that could be computed.

Highlighting these issues does not discredit these applications in any way. Both the Census and the other entities deserve acknowledgment for taking on these challenging data sets, and many lessons have been learned through them. The U.S. Census Bureau’s application has raised a number of important questions about structural constraints and small populations, and has opened the eyes of many social researchers to the inner workings of privacy protection. Google and Apple pushed forward the idea of DP in deep learning, and Uber worked through privacy issues in the context of SQL.

In spite of these advances, though, focusing on large-scale applications such as these runs two primary risks. First, they are extremely resource-heavy, taking years of work and the efforts of many privacy experts to produce only somewhat-satisfactory results. Second, because these use cases are more difficult, the field lacks a foundation of simple applications that work well.

The focus on more-complex databases has fueled skeptics of DP to point out the flaws in these efforts as reasons to abandon this privacy definition completely. High-profile applications are necessary to raise awareness, but, now that DP is becoming a more-familiar idea, a set of easier use cases spread across different fields is needed. In the same way that traditional protection methods accumulated years of applications that built the SDC field, more use cases will help acclimate data users to analyzing DP data and understanding the limitations it creates on their studies.

Looking to the future, statisticians must play a crucial role in meeting these challenges, and should view guaranteeing data privacy as a required part of producing high-quality research. The current privacy debate demands more voices from statisticians, and more researchers interested in sharing their data to develop practical applications. We should all adopt privacy as part of the standard statistical toolbox, rather than leaving it only for the specialists. Lack of adequate involvement from the statistics community risks producing either biased public research data or an overall lack in the availability of statistical tools and resources that enable data sharing.

Instead, we should work together with other disciplines to push the statistical data privacy field forward.

Acknowledgment

The authors thank Dr. Brian Vegetabile for providing helpful comments that greatly improved the quality of this article.

Further Reading

Calibrating Noise to Sensitivity in Private Data Analysis (PDF download).

Differential Privacy: A Primer for a Non-technical Audience (PDF download).

Differential Privacy and Social Science: An Urgent Puzzle.

Google COVID-19 Community Mobility Reports.

Lancet, NEJM retract controversial COVID-19 studies based on Surgisphere data.

About the Authors

Joshua Snoke is an associate statistician at the RAND Corporation. His research focuses on statistical data privacy methods for increasing researchers’ access to data that are restricted due to privacy concerns. He has published work on various statistical data privacy topics, such as differential privacy, synthetic data, and privacy-preserving distributed estimation. He serves on the Privacy and Confidentiality Committee of the American Statistical Association and the RAND Human Subjects and Protections Committee. He received his PhD in statistics from the Pennsylvania State University.

Claire McKay Bowen is the lead data scientist of privacy and data security at the Urban Institute. Her research focuses on comparing and evaluating the quality of differentially private data synthesis methods and science communication. After completing her PhD in statistics at the University of Notre Dame, she worked at Los Alamos National Laboratory, where she investigated cosmic ray effects on supercomputers. She is also the recipient of the NSF Graduate Research Fellowship, Microsoft Graduate Women’s Fellowship, and Gertrude M. Cox Scholarship.

Back to Top

Tagged as: , , ,