Protecting Privacy and Confidentiality in an Era of Big Data Access

Julia Lane

“Big Brother Is Watching You,” George Orwell, 1984

Orwell’s words ring true. Big Data mean that a new analytical paradigm is open to statisticians and social scientists (see a discussion by Hey, Tansley, & Tolle, (2009)). The statistical community has moved beyond survey and even administrative data to begin to understand how data can be mined from social media to capture national sentiment, from cellphone data to understand anti-government uprisings, and from financial data to examine swings in the economy. The funding opportunities are also there, viz. the White House Big Data initiative. But the excitement of being able to access and analyze large amounts of micro data on human beings should be tempered by a commitment to minimize the threats to individual privacy and confidentiality.

Privacy and confidentiality protections are threatened in this brave new world of data because the traditional role of data producers has become less relevant. This means the standard sets of confidentiality protections that were applied to data collected by statistical agencies, the traditional data producers, are also less relevant. In the old paradigm, trained statisticians produced, curated, and disseminated large-scale surveys. Statutory protections, such as Title 26 and Title 23 of the U.S. code provided penalties for breaches of confidentiality, and agencies developed researcher access modalities in accordance with their statutory authorization.

In the new paradigm represented by Big Data, each individual is his or her own data producer, the data are housed in businesses or administrative agencies, and researcher access is largely unregulated. Indeed, while both major funding agencies (the National Science Foundation and the National Institutes of Health) have mandated that funded researchers must provide plans for access to the data they collect (see the NSF website), there is no specific guidance for protecting confidentiality. There is a plethora of examples of how naïve releases of data can lead to reidentification, but it is clear that breaches of confidentiality that result from the actions of one researcher affect the ability of scientists everywhere to collect and use data. And preserving access to high quality scientific data is essential to the empirical replication that is at the core of good science.

Pushing the metaphor further, the current state of Big Data for the research community is still the Wild West. As order and analytical rigor is brought to the new data frontier, we should ensure that the structure ensures that the goal of good science is attained while protecting confidentiality. The two key features that were embodied in statistical agencies were (i) institutional structures that provided access to curated data to promote replication of analysis (ii) trained ‘data scientists’ who were able to develop statistical and technical approaches to reduce the risk of re-identification (for details, see Statistical Confidentiality: Principles and Practice). These features could be reproduced in the United States; the appropriate infrastructure support should be provided by the very funding agencies that support the creation and mandate the dissemination of data.

Why Data Access Matters

The creation and analysis of high-quality information are core elements of the scientific endeavor. No less fundamental is the dissemination of such data, for many reasons. The first is that data only have utility if they are used. Data utility is a function of both the data quality and the number and quality of the data analysts. The second is replicability. It is imperative that scientific analysis be able to be replicated and validated by other researchers. The third is communication. Social behavior is complex and subject to multiple interpretations: the concrete application of scientific concepts must be transparently communicated through shared code and metadata documentation. The fourth is efficiency. Data are expensive to collect—the U.S. 2010 Census alone cost over $13 billion—so expanding their use, promoting repurposing and minimizing duplication is fiscally responsible. Another reason is capacity-building. Junior researchers, policy makers and practitioners need to have the capacity to go beyond examining tables and graphs and develop their understanding of the complex response of humans to rapidly changing social and legal environments. Access to micro-data provides an essential platform for evidence based decisionmaking. Finally, access to micro-data permits researchers to examine outliers in human and economic behavior—which is often the basis for the most provocative analysis.

These arguments are not simply theoretical. The value added of access to micro-data is confirmed empirically. Illustrative examples include the deeper understanding of business dynamics made possible by examining the contribution of firm births and deaths, as well as expansions and contractions to net employment growth. Similarly, a landmark 1954 study of survey data on doctors’ smoking habits matched to administrative data on their eventual cause of death was critical in establishing the link between smoking and both cancer and coronary thrombosis (for more discussion on these issues, see the proposal by the International Data Forum). And data on individual financial transactions are now routinely used to model and limit losses due to defaults on loans.

Institutional Structures

There are many ways in which data can be curated and made accessible, but they require infrastructure support. European researchers are particularly fortunate to have infrastructure investments supporting social science research with such institutions as CESSDA (Council of European Social Science Data Archives) and DASISH (Data Service Infrastructure for the Social Science and Humanities). In the United Kingdom, the UK Data Service “manages the research data lifecycle to ensure an ongoing process of data curation, research and use.” A recent Global Science Forum report by Elias & Entwisle (2012) has recommended the following:

“National research funding agencies should collaborate internationally to provide resources for researchers to assess the research potential of new forms of data to address important research areas.

National research funding agencies should collaborate internationally to help specify and provide resources to develop new methods to understand the opportunities and limitations offered by new forms of data.”

What this means in practice is that data dissemination, which is inherently a public good, can be institutionalized so that best practices can be identified. These best practices would ideally go beyond the social science community and reach into the broader research community, including the extensive work done by computer scientists (see, for example, the research done under NSF’s Secure and Trustworthy Cyberspace initiative) to protect defense related and financial data. Indeed, the technological factors that have led both to the explosive growth in the capability of providing such data, as well as in the capability of data snoopers to breach confidentiality, are in the domain of computer scientists as well as social scientists. A community of practice could be developed that would include engaged researchers that can be tasked to initiate and nurture a Web 2.0 style community of practice that focuses on promoting secure means of access to sensitive data. The best practices might include a set of citation standards for data that could be adopted by research foundations and required of grantees to protect intellectual property rights and promote the acknowledgement of researcher contributions to data development; for a rich set of discussions and papers, see Victoria Stodden’s research page.

In the United States, then, the requirement by NSF and NIH to disseminate research data should be accompanied by a commitment by the agencies to professionally support that dissemination. Just as, for example, NSF supported supercomputer centers to promote high-end computing, NSF could support professional data repositories and professional communities that would act as vehicles for data dissemination and the development of workflow tools.

Developing ‘Data Scientists’

An eloquent description of statistical confidentiality is “the stewardship of data to be used for statistical purposes” according to Duncan et al. (2011). Statistical agencies have been at the forefront of developing that stewardship community in a number of ways. First, on the job training is provided to statistical agency employees. Second, in the United States, academic programs such as the Joint Program in Survey Methodology, communities such as the Federal Committee on Statistical Methodology, and resources such as the Committee on National Statistics have been largely supported by the federal statistical community. But the focus is almost exclusively on developing methodologies to improve the analytical use of survey data, and to a lesser extent, administrative data. Nothing similar exists to train scientists in developing an understanding of such issues as identifying the relevant population and linkage methodologies. Such training is essential not only because it is important to draw analytical inferences, but also because disclosure limitation relies heavily on an understanding of how many individuals in the population have a particular set of characteristics. “Data scientists” in the context of Big Data might be broadly defined. They might include cryptographers interested in differential privacy such as Cynthia Dwork, computational scientists (see Allen, B. et al. (2011)) as well as statisticians. And there is precedent for the support of workforces in just these areas; for example, the NSF CI-TEAM solicitation supported the development of a cyberinfrastructure workforce.

Summary

In sum, the brave new world of data has created new demands for infrastructures that will both act to disseminate data and protect confidentiality. Funding agencies such as NSF have issued important calls, like the recent “Integrative Data Management” call, but the long-term support and management of data is necessary if we are to realize the analytical and scientific promise of Big Data without Orwellian violations of the confidentiality of information about human beings.