Making Confidential Data Part of Reproducible Research

The rise of data-centric research practices has uncovered shortcomings in the traditional scholarly communication system. The foundation of that system, the peer-reviewed publication—”[the] selective distribution of ink on paper, or… electronic facsimiles of the same,” according to Bourne, et al. (2011)—does not adequately support what has become an essential element of scholarship: the reproducibility of research results.

This refers to duplicating a reported result with the data, tools, techniques, etc., used in previous research. The notion of reproducible research is appealing for a number of reasons, including facilitating novel research that “builds on the shoulders of giants,” allowing the testing of veracity of existing research, and educating new scholars in common research practices.

Reproducibility depends on disaggregating and exposing the multiple components of the research—data, software, workflows, and provenance—to other researchers and providing adequate metadata to make these components usable.

The belief in the importance of reproducibility is shared by a large number of scientists, in many disciplines, who have pushed ever stronger for a modernization of the current system of scholarly communication, including support for reproducibility. In particular, we refer to the set of recommendations articulated by Stodden, et al. (2016), and which they call “Reproducibility Enhancement Principles” (REP).

This column focuses on issues of confidentiality, which is intimately linked to reproducibility. There has been considerable concern in academic circles about a perceived lack of reproducibility of studies that are based on “proprietary,” “confidential,” or “administrative” data, terms that are often conflated but that do not, in fact, describe the same type of data. The key worry is access: The authors of a study that uses confidential data cannot themselves deposit the data with the journal, thereby impairing easy access to those data and consequently impeding reproducibility.

The conclusion, often heard at conferences, is that confidential data cannot be part of scientific process. We beg to disagree with that blanket statement.

In this column, we address issues surrounding the reproducibility of confidential data held by national statistical offices (NSOs). In the United States, these might be the U.S. Census Bureau, Bureau of Labor Statistics, or Energy Information Administration, but similar research data access is possible in Canada, Germany, France, Norway, etc. We do not address issues of confidential data stemming from sub-national administrations (states, counties, cities, school districts) or from private companies (individual company data, or data on other entities provided by privately held companies, such as social media data). The key distinction between NSOs and these other entities is how well the owner curates and manages access to the data.

There are real issues with sub-national administrations and private companies that cannot be handled easily. We argue, though, that data held by NSOs do have attributes that lend themselves to reproducibility exercises, although this may, at present, not always be communicated correctly.

How significant is this body of research? Based on our analysis of one particular journal (American Economic Journal: Applied Economics, 2009–2013), studies that use data held by NSOs account for about half of all studies using confidential data.

While some groups propose ambitious, long-term transformations in scholarly communication to achieve large-scale reproducibility, our goal is to describe some “low-hanging fruit” with which we can achieve a measure of scientific reproducibility for this large body of scholarly work based on confidential data in the custody of NSOs. Such data effectively can, in fact, be part of a reproducible scientific endeavor, although there are potholes and bumps on the path to achieving the goal of reproducibility.

Our modest “proposal” (a series of suggestions) leverages existing tools and practices. It promotes reproducibility through a number of components:

  • Descriptions of methodology, which already exist in the traditional publication framework.
  • Naming systems, which make it possible to assign unique identifiers to the components of research; data, publications, software artifacts, etc.
  • Archived data, which in all cases is citable by other researchers and, in cases where the data are properly anonymized, is retrievable.
  • Metadata, which describes the structure (variables) of data used in the research. In cases of restricted micro-data, this metadata may be partially cloaked to hide protected variables and values.
  • Template language, which can be used by researchers in their publications to highlight availability of data, proper data citation, and caveats to the data.

Important components of these methods are an institutional commitment to maintain such practices, and easy-to-use tools and templates for researchers to leverage.

Data Access is Key

The key to reproducible confidential data is having mechanisms to facilitate non-exclusive access. Most NSOs already have mechanisms in place that ensure that data access is not exclusive to the original researcher. It is not, however, available to just anybody.

We might complain that it is not feasible for the authors of this column (both of whom are Americans) to reproduce a study that uses Norwegian data, because data access there might require citizenship as a condition. It is also true that to access U.S. Census Bureau data, a minimum amount of in-country residence is required. However, in both cases, many hundreds, if not thousands, of other researchers are qualified and able to access the data.

All is well, then? Not necessarily.

Application protocols vary across countries, and within countries, across agencies, sometimes across data sets. Often, information about the application process is obscure or complicated, and sometimes, requesting access is limited to certain “call for proposals” deadlines or other infrequent and limited time periods.

The Environment

A second key factor is that most access to NSO-managed confidential data occurs within carefully controlled, restricted environments. These lend themselves particularly well to implementing reproducible research. We illustrate this using the process in the Federal Statistical Research Data Centers (FSRDCs), but similar processes are used in most restricted-access environments in the U.S. or abroad.

After completing the FSRDC-resident research, researchers must submit a Request for Clearance of Research Output. This request give the legal curator of the data the basis on which to review the outputs of the research and determine whether these can be safely released publicly. This review verifies that the results are sufficiently anonymized to conform to the confidentiality mandates that the data are subject to.

The form asks for three key elements: the input data used for the research, the nature of the output files, and the analysis programs used to transform the input data into the output files, all of which are scrutinized to ensure that the outputs protect confidentiality. Disclosure protection is specified down to the variable level.

Those same elements are the minimal elements required to “enable independent regeneration of computation results […] data, computational steps that produced the findings, and the workflow describing how to generate the results using the data and code,” as recommended by Stodden, et al. (2016). Thus, by its very nature, the restricted access environment obliges the researcher to comply with reproducibility requirements—and, in fact, in the case of the FSRDC, has been doing so for over 20 years!

Again, all is well now? We don’t think so.

While all the key elements are present, they are hard to leverage for the average researcher. Even without the use of unique identifiers (which we will get to), most researchers do not communicate these elements to peers and journals, or if they do, do so in a highly inconsistent fashion.

One of us has informally surveyed over 100 authors of published articles about access to the data used in their studies. Of those who responded (less than half), few could adequately describe the access protocols for the data they had used. While part of the blame must reside with the researchers themselves, the NSOs granting access to the data must shoulder part of the blame as well, by providing either no or inconsistent citable documentation on those key elements.

Even when researchers post pre-publication working papers in NSO-managed archives (a frequent practice in economics, where publication lags are long), the data description is idiosyncratic, and non-compliant with any of several modern data citation standards. In part, this is due to the fact that, with rare exceptions, there is no systematic, referenceable catalog for the confidential data. To the best of our knowledge, no restricted-access data center network provides a way to reference the workflow (programs) and its outputs (disclosable results).

Identifiers: A Requirement of a Reproducible Research Environment

To facilitate and encourage the reproducibility of scholarly results, all the entities involved in the process of producing those results—people, publications, data, and computational artifacts—should be exposed, and the relationships among them expressed. Identification is the foundation for making this possible, enabling the citation and, ideally, the retrieval of information (metadata) about an information object. Attributes should have global uniqueness, be machine actionable and human usable, and be time-sensitive for dynamic data. In particular, identifiers should be persistent.

Finally, identifiers should be “metadata aware“: A data identifier should resolve not only to a particular data set, but also to the metadata associated with it; an especially important point for confidential data. In human terms, this may be achieved by the identifier resolving to a readable “splash page.” In general, the commonly used URLs of the World Wide Web are not adequate. The most frequently used identifier schema for publications, data, and other artifacts is the digital object identifier (DOI).

The DataCite initiative has taken the lead in data identification, organizing services to mine DOIs for data sets, associate basic metadata with these name data, provide search services for distributed data, track data use, and serve other functions.

Recommendations

Institutional commitment

A key to any of the proposed suggestions, whether they involve simple procedures or more-complex mechanisms and infrastructure, is institutional commitment. Policies and procedures must be implemented, committed to, and managed in a persistent and transparent fashion.

For instance, the institutional infrastructure that supports DOI landing pages may change radically within even a short timeframe, breaking the relationship between data set and the DOI registered for it in an earlier period. There has to be institutional commitment to ensure that these changes are propagated to the DOI registrar in a timely fashion, guaranteeing that the object is mapped into the DOI contemporaneously. Commitment does not necessarily imply monetary expense, but a high-level promise to engage with these mechanisms as a matter of policy.

Suggestion: Research centers of NSOs should commit to maintaining policies that support reproducible research consistently and persistently.

Transparency of Application Process

We noted earlier that access protocols may be ill-defined, depend on time-sensitive application windows, and provide little public guidance on the expected duration of application procedures.

Suggestion: Applications for access to confidential data should be allowed continuously, or regularly and often, such as monthly or quarterly. The process should be transparent and predictable.

Predictability does not imply homogeneity. When access requests are reviewed by k multiple agencies, the recurring joke is that the time for approval is aebk. As long as review periods are reasonable, predictability is key. Over the long term, a transparent, centralized, multi-stakeholder application tracking process should be built, as is being considered in France.

The process could be managed by the NSO, by a grant agency, or by a third party, using open-access APIs to track a uniquely identified proposal through well-defined stages of approval and review. This might provide confidence to prospective researchers, as well as journal editors, that the process is transparent and reasonably efficient in moving from proposal to approval stage.

Provide authoritative citations for existing objects and access procedures

In almost all NSO-managed systems, a researcher will receive, at some point in the process, an object from the secure system—typically model-based statistics (regression results) and usually by way of a disclosure review board or a privacy officer. We suggest that each such release should contain machine- and human-readable metadata on all the relevant objects: a standard citation for the input data in publications, in the form of both standardized language and a full data citation. The German Institute for Employment Research, for instance, provides examples for the former:

This study uses the weakly anonymous Establishment History Panel (Years YYYY–YYYY). Data access was provided via on-site use at the Research Data Centre (FDZ) of the German Federal Employment Agency (BA) at the Institute for Employment Research (IAB) and/or remote data access.

An example of a data citation is:

U.S. Census Bureau. 2014. Geo-coded Address List (GAL) in LEHD Infrastructure, S2011 Version [computer file]. Washington, DC: U.S. Census Bureau, Center for Economic Studies, Research Data Centers [distributor].

In addition to citing the data, the method of access should be described. A standard statement should indicate a brief summary of what conditions must be met (if any) to qualify for access to the data, with pointers to more complete descriptions. For example, the following statement might be useful:

The data in this article are confidential and only accessible within the Federal Statistical Research Data Center network to qualified researchers on approved projects. Qualified researchers include most researchers affiliated with a U.S. academic institution. Approved projects are legally required, among other things; provide benefits to programs of the [data–providing agency]; require non–public data; and pose no risk of disclosure. More information is available online.

These statements can be provided by authors to journals, and cited by authors in footnotes, data access descriptions, etc., as appropriate for each publication outlet.

Finally, all managed computer systems are associated with archive facilities. By providing researchers with sufficient information to recover programs from archives, NSOs can make intermediate data, programs, and workflow documentation traceable.

Ideally, of course, programs and workflow documentation are themselves not confidential, and should be provided to researchers as part of the release of results. Nevertheless, by providing a citable location, NSOs can cover those scenarios where intermediate programs are too complex and costly to analyze for disclosure risks, and provide additional credibility that the programs provided to journals are, in fact, the programs that were used to produce the results.

Suggestion: Object citations and access descriptions should be provided as part of every release of results by NSOs, customized to the project that is requesting release of such results. (Providing them is inexpensive, and they will go a long way toward improving the perceived reproducibility of the research.)

By doing this, the NSO makes it easy for researchers to give proper credit for shared digital objects, as noted in the third recommendation.

Consistent use of persistent identifiers for all the components of the research process

We argued that unique identification of information resources is a necessary precursor for their reusability. Unique identification hinges on having “naming policies” in place, and procedures to implement them. To be useful, these identifiers have to be public, but they do not have to use DOIs; they can rely on existing identifier systems. Converting idiosyncratic identifier systems to DOIs at a later stage is easy, and should certainly be included in any long-term plan, but much mileage can be had out of implementing some standardized identifier system right away.

Critically, a “landing page” for each object—a dedicated, referenceable page of an online catalog—has to exist, with suggested citations.

Using the FSRDCs as an example, all data sets and projects are tracked by a formal internal project or content management system (CMS). A data set might be referenced as “cmsd00035,” and a project as “cmsp000538.” These should handle versioning, where possible—for instance,”cmsd00035v2″ would indicate a second released version of data set 35.

Once a naming policy is implemented, it becomes relatively straightforward to provide landing pages for all such objects, e.g., http://rdc.nso.gov/cmsd00035.

While most data archives (e.g., ICPSR) use such standardized URLs, most NSOs that we are aware of do not. In particular, project-related splash pages do not exist. Nevertheless, it would seem straightforward to publish some details for projects, such as titles and abstracts, as some FSRDCs already do in newsletters and the like.

Suggestion: NSOs should implement a naming schema covering all objects related to the research workflow, publish the details for the naming schema, and create landing pages for all such objects.

The suggestion actually entails implementing all the conditions for assigning DOIs. The long-term goal should be assigning DOIs to all these objects, and thus achieving the second recommendation.

Benefits

Once global identifiers permeate the system, it is easy to use the existing facilities of CrossRef, DataCite, and other ongoing projects to construct citation impacts for the data used, and for the papers created based on the data. Researchers can obtain credit for the digital objects they create. NSOs and funding agencies can measure the impact on research and policy in an objective manner.

Infrastructure to that extent already exists in Europe: Open-AIRE. Researchers who receive funding from NSF, NIH, etc., obtain citable objects for grant reporting mechanisms and can prove compliance with data management plans. With properly identified data, replicable processes, and predictable access mechanisms, it becomes possible to conceive of replication challenges.

Conclusion

We have outlined a number of steps that national statistical offices and their associated research centers could undertake to improve actual and perceived reproducibility of research that leverages data under their control. Some of these steps are very easy to take and could be implemented quite quickly. We take no credit for coming up with the examples; for each process we suggest, we are aware of at least one NSO that is actively following that process. However, no NSO, to our knowledge, implements all of the suggestions.

Additional steps are required beyond these suggestions, such as comprehensive documentation of all digital objects (the fourth recommendation), but taking these initial steps is critical for starting down that path.

Note: While this column mentions the U.S. Census Bureau several times, any opinions and conclusions expressed herein are those of the authors and do not necessarily represent the views of the Census Bureau or the other statistical agencies mentioned here.

The authors’ work was supported by NSF grant #1131848 (NCRN) and by a grant from the Alfred P. Sloan Foundation.

Further Reading

Abowd, J.M., Vilhuber, L., and Block, W. 2012. A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs. In J. Domingo-Ferrer & I. Tinnirello (eds.), Privacy in Statistical Databases 7 (556) pp. 216–225. Berlin/Heidelberg, Germany: Springer Berlin Heidelberg.

Altman, M., Arnaud, E., Borgman, C., Callaghan, S., Brase, J., Carpenter, T., and Socha, Y. (Editor). 2013. Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for Data Citation. Data Science Journal, 12, pp. 1–75.

Bourne, P.E., Clark, T., Dale, R., de Waard, A., Herman, I., Hovy, E., and Shotton, D. 2011. FORCE11 MANIFESTO. FORCE11: The Future of Research Communications and e-Scholarship.

ICPSR. 2016. Citing Data.

Stodden, V., McNutt, M., Bailey, D.H., Deelman, E., Gil, Y., Hanson, B., Taufer, M. 2016. Enhancing reproducibility for computational methods. Science 354 (6,317), p.p. 1,240–1,241.

About the Authors

Carl Lagoze is an associate professor at the University of Michigan School of Information. He received his PhD in information science from Cornell University. The overarching theme of his research for the past two decades has been interoperability of information systems, spanning the full spectrum of technical and human components that are critical to creating networked information systems that work. The primary thread of his research explores information systems to support scholarship and knowledge production. His work has been widely used in areas such as metadata harvesting, ontology definition, and repository architecture.

Lars Vilhuber holds a PhD in economics from the Université de Montréal, Montreal, Canada, and studied economics at the Universität Bonn, Germany, and Fernuniversität Hagen, Germany. His research interests lie in the dynamics of the labor market. His research in statistical disclosure limitation issues is a direct consequence of his interest in making data available in a multitude of formats to the broadest possible audience. He is presently on the faculty of the Department of Economics at Cornell University, executive director of Cornell’s School of Industrial and Labor Relations’ (ILR) Labor Dynamics Institute, and a senior research associate at the ILR School at Cornell.


O Privacy, Where Art Thou? takes a statistical look at data privacy and confidentiality. If you are interested in submitting an article, please contact Aleksandra Slavkovic, column editor, at sesa@stat.psu.edu.

Back to Top

Tagged as: , ,

Leave a Response

Please note: comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.