Genomics and Privacy
Genomics is not becoming the next significant challenge for privacy, it already is the next privacy challenge, and how we address it will have a global impact.
Agencies around the world have been putting considerable effort into collecting clinical and genomic data. And they have been building large databases—one of the most important is the Database of Genotypes and Phenotypes at the U.S. National Library of Medicine (dbGaP)—in an effort to support a variety of personalized health-care initiatives.
Genome-wide association studies (GWAS), in particular, reap great benefits from rapid developments in high quality genomic-data collection. In a report by the National Human Genome Research Institute, the total number of genome studies published rose from fewer than 50 in 2005 to around 900 in 2010 and almost 2,000 in 2013.
The increased collection and sharing of sensitive genomic information has raised significant societal issues, including concerns over individual privacy and confidentiality. In 2008, a paper by Nils Homer and a team of fellow researchers raised concerns that individuals can be easily identified from their genomic data, even when that data is disseminated in an aggregated form. In response, genomic data-curating agencies including the National Institutes of Health (NIH), the Wellcome Trust, and the Broad Institute not only removed genomic summaries of case and control cohorts from public access, but also instituted stricter policies on granting access to genomic summaries. This NIH policy remains in effect today. J. Couzin in a 2008 Science article discusses such issues.
Genomic Data Privacy Risks
Tightened genomic data access policies set in motion two contradictory movements in the genetics research community: (1) those studying potential privacy breaches associated with controlled release of genomic data; and (2) those advocating full access to personal genomic data. While it is clear that individual-level genetic data deserve a high level of protection, for many years researchers believed that releasing aggregated statistics from thousands of individuals in a GWAS would not compromise the genetic-study participants’ privacy. Such a belief came under challenge in 2008 when Homer and his fellow researchers found that one can combine minor allele frequencies published in a GWAS with genetic data from publicly available sources, such as single-nucleotide polymorphism data from the HapMap project, to determine whether an individual has participated in that particular GWAS.
Many publications have been devoted to identifying other potential privacy breaches since then, showing how genomic information stored in aggregate form can be exploited through minor allele frequencies. For example, D. Clayton proposed a Bayesian approach to testing a membership status of an individual in a particular sample, and Thomas Lumley and Kenneth Rice used regression results to predict a study participant’s disease status. Some researchers harness genomic summaries to impute and recover individual-level data. We have also seen breaches that take advantage of the linkage between genomic data and metadata. For an overview of these and other strategies researchers have discovered for determining the identities of individuals in aggregate genomic data, see the article in Further Reading by F. Yu et al. in the Journal of Biomedical Informatics. On the other hand, some researchers, such as those associated with the Personal Genome Project, advocate full openness to personal genomic data, accepting that privacy in this setting may not be feasible.
In July 2013, more than 70 leading medical and research organizations from around the world, including the NIH and the Wellcome Trust, declared their intent to form a global alliance to build a framework for sharing genomic and clinical data they collect from genomic study participants. Therefore, it will be important and timely for us to understand the underlying privacy and confidentiality risks of genomic data sharing, and possible methods in statistics and computer science that will enable sharing of usable genomic data while minimizing disclosure risk.
Privacy Protection and Data Sharing
Researchers have started thinking about how to provide privacy protection while preventing linkage attacks on genomic data. One of the main difficulties of privacy protection is that it is almost impossible to control auxiliary information available to an attacker. Many attacks on genomic data rely on strong correlations between released databases, publicly available data sets, and the sparseness and high dimensionality of genomic data. While such strong correlations are often explicitly masked thanks to genomic data curating agencies’ regulations (e.g., NIH’s HIPAA privacy rule), these correlations can still be revealed in varying degrees by auxiliary information.
Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with genomic databases, especially in terms of guarantees regarding protection from linkage to external information. For example, in a recent study in Science, researchers were able to identify surnames of participants by linking their genetic data with recreational genetic genealogy databases. Thus simply removing identifying information is not enough, and we need more careful methods of protecting privacy; see this article for more details.
More recent research on privacy tries to take into account the possibility of unforeseeable availability of auxiliary information by being very precise about what kind of privacy guarantees can be offered. Out of various privacy protection approaches, differential privacy is quickly becoming a widely acceptable model for privacy protection. Suppose that a person is considering participating in a study that sequences her genetic data, and worrying that this data or results of the study could somehow be used in an unfavorable manner against her, such as denying her insurance. Differential privacy tries to alleviate the concerns of such a user. Any analysis carried out using differential privacy is endowed with the guarantee that, if you decide to take part in the study, an intruder (e.g., an insurance company) will not learn anything more about you than what he or she already knew about you.
Such a strong privacy guarantee is not easy to provide, of course, and it may come at a serious price in terms of data utility. Researchers are working on developing differentially private algorithms for genomic data sharing. For example, Uhler et al., Yu et al., and Johnson and Shmatikov (see Further Reading) were first to propose differentially private algorithms to release SNPs that are most strongly associated with a phenotype (e.g., a disease), which is a task commonly carried out in genome-wide association studies. There are recent extensions to releasing coefficients of penalized logistic regressions in this setting as well.
These algorithms have been applied to real human GWAS data sets and evaluated by analyzing the trade-off between privacy protection and statistical utility of the released data. They show promise for supporting broader sharing of genomic data with rigorous privacy guarantees.
Current research on differentially private algorithms is providing tools to share genomic data while controlling the level of privacy protection for genetic study participants. However, how to wield these tools properly in practice remains an open question. The main challenge of using differential privacy is balancing how much privacy and data utility to preserve. Intrinsic to differential privacy is a tuning parameter that controls the level of privacy protection. The tuning parameter correlates with data utility in different ways depending on the nature of the data and the algorithm. Understanding the limits of the privacy-tuning parameter and how to choose it in a sensible way is one of the key hurdles for making differentially private algorithms useful in practice in this data setting and others.
To take full advantage of the large amount of genetic data collected, it is imperative that data are shared among researchers. Not only is the sharing of genetic data essential for forming larger data sets for analysis, but it also makes resource allocation more efficient by reducing the number of duplicate experiments, and supports reproducibility and scientific discovery. NIH’s action of limiting access to aggregated human genomic data has spurred interest in the development of methods for confidentiality and privacy protection of GWAS databases. The most significant methods to date have risen from interdisciplinary research that combines statistical notions of utility with algorithmic thinking and risk measures from computer sciences. Whether this is going to be the most useful framework remains to be seen. But the problem of more broadly sharing useful human-genomic data for research purposes and clinical discovery while maintaining individuals’ privacy is not going away any time soon.
Clayton, D. (2010). On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics (Oxford, England), 11(4):661–673. doi:10.1093/biostatistics/kxq035
Couzin, Jennifer. 2008. Whole-genome data not anonymous, challenging assumptions. Science 321.5894:1278.
Dwork, C. and A. Smith. 2010. Differential privacy for statistics: what we know and what we want to learn. Journal of Privacy and Confidentiality 1(2)
Gymrek, M., A. McGuire, D. Golan, E. Halperin, and Y. Erlich. 2013. Identifying personal genomes by surname inference. Science 339.6117:321–4.
Johnson, A., and A. Shmatikov. 2013. Privacy-preserving data exploration in genome-wide association studies. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1079–1087.
Lumley, Thomas, and Kenneth Rice. Potential for revealing individual-level information in genome-wide association studies. JAMA 303.7:65960.
Uhler, C., A. Slavkovic, and S. Fienberg. 2013. Privacy-preserving data sharing for genome-wide association studies. Journal of Privacy and Confidentiality 5.1:137–166.
Welter, D, J. MacArthur, J. Morales J, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Research 2014;42(Database issue):D1001–D1006. doi:10.1093/nar/gkt1229.
Yu, F., S. Fienberg, A. Slavkovic, and C. Uhler. 2014. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics 50C:133–141.
Zhou, X., B. Peng, Y. F. Li, Y. Chen, H. Tang, and X. Wang. 2011. To release or not to release: evaluating information leaks in aggregate human-genome data. ESORICS. Springer, 607–627.
About the Authors
Aleksandra Slavkovic earned her PhD from Carnegie Mellon University. She is an associate professor of statistics, with appointments in the department of statistics and Institute for CyberScience at Penn State University and department of public health sciences at Penn State College of Medicine. She serves as an associate editor of the Annals of Applied Statistics, Journal of Privacy and Confidentiality, and Journal of Statistical Computation and Simulation. Her primary research interest is in data privacy and confidentiality. Other research interests include evaluation methods for human performance in virtual environments, statistical data mining, application of statistics to social sciences, algebraic statistics, and causal inference.
Fei Yu received his PhD in statistics from Carnegie Mellon University in 2015. He is now a member of the technical staff at Bell Labs. His dissertation research is on scalable privacy-preserving data-sharing methodologies for genome-wide association studies.
O Privacy, Where Art Thou? takes a statistical look at data privacy and confidentiality. If you are interested in submitting an article, please contact Aleksandra Slavkovic, column editor, at firstname.lastname@example.org.