Genomics and Privacy

Aleksandra Slavkovic and Fei Yu

lock

Genomics is not becoming the next significant challenge for privacy, it already is the next privacy challenge, and how we address it will have a global impact.

Agencies around the world have been putting considerable effort into collecting clinical and genomic data. And they have been building large databases—one of the most important is the Database of Genotypes and Phenotypes at the U.S. National Library of Medicine (dbGaP)—in an effort to support a variety of personalized health-care initiatives.

Genome-wide association studies (GWAS), in particular, reap great benefits from rapid developments in high quality genomic-data collection. In a report by the National Human Genome Research Institute, the total number of genome studies published rose from fewer than 50 in 2005 to around 900 in 2010 and almost 2,000 in 2013.

The increased collection and sharing of sensitive genomic information has raised significant societal issues, including concerns over individual privacy and confidentiality. In 2008, a paper by Nils Homer and a team of fellow researchers raised concerns that individuals can be easily identified from their genomic data, even when that data is disseminated in an aggregated form. In response, genomic data-curating agencies including the National Institutes of Health (NIH), the Wellcome Trust, and the Broad Institute not only removed genomic summaries of case and control cohorts from public access, but also instituted stricter policies on granting access to genomic summaries. This NIH policy remains in effect today. J. Couzin in a 2008 Science article discusses such issues.

Genomic Data Privacy Risks

Tightened genomic data access policies set in motion two contradictory movements in the genetics research community: (1) those studying potential privacy breaches associated with controlled release of genomic data; and (2) those advocating full access to personal genomic data. While it is clear that individual-level genetic data deserve a high level of protection, for many years researchers believed that releasing aggregated statistics from thousands of individuals in a GWAS would not compromise the genetic-study participants’ privacy. Such a belief came under challenge in 2008 when Homer and his fellow researchers found that one can combine minor allele frequencies published in a GWAS with genetic data from publicly available sources, such as single-nucleotide polymorphism data from the HapMap project, to determine whether an individual has participated in that particular GWAS.

Many publications have been devoted to identifying other potential privacy breaches since then, showing how genomic information stored in aggregate form can be exploited through minor allele frequencies. For example, D. Clayton proposed a Bayesian approach to testing a membership status of an individual in a particular sample, and Thomas Lumley and Kenneth Rice used regression results to predict a study participant’s disease status. Some researchers harness genomic summaries to impute and recover individual-level data. We have also seen breaches that take advantage of the linkage between genomic data and metadata. For an overview of these and other strategies researchers have discovered for determining the identities of individuals in aggregate genomic data, see the article in Further Reading by F. Yu et al. in the Journal of Biomedical Informatics. On the other hand, some researchers, such as those associated with the Personal Genome Project, advocate full openness to personal genomic data, accepting that privacy in this setting may not be feasible.

In July 2013, more than 70 leading medical and research organizations from around the world, including the NIH and the Wellcome Trust, declared their intent to form a global alliance to build a framework for sharing genomic and clinical data they collect from genomic study participants. Therefore, it will be important and timely for us to understand the underlying privacy and confidentiality risks of genomic data sharing, and possible methods in statistics and computer science that will enable sharing of usable genomic data while minimizing disclosure risk.

Sharing Your Data

One of the issues with privacy research is that it is really an interdisciplinary area. There are a number of ethical and ownership issues also involved. People who agree to donate their genetic data not only disclose private information about themselves, but also risk the privacy of their children and grandchildren.

23andMe is a genetic-testing company known to offer considerably cheap genetic-testing kits. You can buy one of their kits for $99 and get your DNA sequenced. The catch, of course, is that the company gets to keep your genetic data. 23andMe wants to use all this data for medical research, but one can imagine the issues if the data gets sold to third parties such as insurance companies, which can then use this information to sell or deny products. 23andMe’s privacy policy indicates that this is not an acceptable use of their data, but they do not deny that they will share aggregate information. However, we have already seen that sharing aggregate information does not offer privacy protection, and we need to be very careful about sharing such sensitive data.

Privacy Protection and Data Sharing

Researchers have started thinking about how to provide privacy protection while preventing linkage attacks on genomic data. One of the main difficulties of privacy protection is that it is almost impossible to control auxiliary information available to an attacker. Many attacks on genomic data rely on strong correlations between released databases, publicly available data sets, and the sparseness and high dimensionality of genomic data. While such strong correlations are often explicitly masked thanks to genomic data curating agencies’ regulations (e.g., NIH’s HIPAA privacy rule), these correlations can still be revealed in varying degrees by auxiliary information.

Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with genomic databases, especially in terms of guarantees regarding protection from linkage to external information. For example, in a recent study in Science, researchers were able to identify surnames of participants by linking their genetic data with recreational genetic genealogy databases. Thus simply removing identifying information is not enough, and we need more careful methods of protecting privacy; see this article for more details.

More recent research on privacy tries to take into account the possibility of unforeseeable availability of auxiliary information by being very precise about what kind of privacy guarantees can be offered. Out of various privacy protection approaches, differential privacy is quickly becoming a widely acceptable model for privacy protection. Suppose that a person is considering participating in a study that sequences her genetic data, and worrying that this data or results of the study could somehow be used in an unfavorable manner against her, such as denying her insurance. Differential privacy tries to alleviate the concerns of such a user. Any analysis carried out using differential privacy is endowed with the guarantee that, if you decide to take part in the study, an intruder (e.g., an insurance company) will not learn anything more about you than what he or she already knew about you.

Such a strong privacy guarantee is not easy to provide, of course, and it may come at a serious price in terms of data utility. Researchers are working on developing differentially private algorithms for genomic data sharing. For example, Uhler et al., Yu et al., and Johnson and Shmatikov (see Further Reading) were first to propose differentially private algorithms to release SNPs that are most strongly associated with a phenotype (e.g., a disease), which is a task commonly carried out in genome-wide association studies. There are recent extensions to releasing coefficients of penalized logistic regressions in this setting as well.

These algorithms have been applied to real human GWAS data sets and evaluated by analyzing the trade-off between privacy protection and statistical utility of the released data. They show promise for supporting broader sharing of genomic data with rigorous privacy guarantees.

Current research on differentially private algorithms is providing tools to share genomic data while controlling the level of privacy protection for genetic study participants. However, how to wield these tools properly in practice remains an open question. The main challenge of using differential privacy is balancing how much privacy and data utility to preserve. Intrinsic to differential privacy is a tuning parameter that controls the level of privacy protection. The tuning parameter correlates with data utility in different ways depending on the nature of the data and the algorithm. Understanding the limits of the privacy-tuning parameter and how to choose it in a sensible way is one of the key hurdles for making differentially private algorithms useful in practice in this data setting and others.

Summary

To take full advantage of the large amount of genetic data collected, it is imperative that data are shared among researchers. Not only is the sharing of genetic data essential for forming larger data sets for analysis, but it also makes resource allocation more efficient by reducing the number of duplicate experiments, and supports reproducibility and scientific discovery. NIH’s action of limiting access to aggregated human genomic data has spurred interest in the development of methods for confidentiality and privacy protection of GWAS databases. The most significant methods to date have risen from interdisciplinary research that combines statistical notions of utility with algorithmic thinking and risk measures from computer sciences. Whether this is going to be the most useful framework remains to be seen. But the problem of more broadly sharing useful human-genomic data for research purposes and clinical discovery while maintaining individuals’ privacy is not going away any time soon.

About the Authors

Aleksandra Slavkovic earned her PhD from Carnegie Mellon University. She is an associate professor of statistics, with appointments in the department of statistics and Institute for CyberScience at Penn State University and department of public health sciences at Penn State College of Medicine. She serves as an associate editor of the Annals of Applied Statistics, Journal of Privacy and Confidentiality, and Journal of Statistical Computation and Simulation. Her primary research interest is in data privacy and confidentiality. Other research interests include evaluation methods for human performance in virtual environments, statistical data mining, application of statistics to social sciences, algebraic statistics, and causal inference.

Fei Yu received his PhD in statistics from Carnegie Mellon University in 2015. He is now a member of the technical staff at Bell Labs. His dissertation research is on scalable privacy-preserving data-sharing methodologies for genome-wide association studies.

O Privacy, Where Art Thou? takes a statistical look at data privacy and confidentiality. If you are interested in submitting an article, please contact Aleksandra Slavkovic, column editor, at sesa@stat.psu.edu.

Tagged as: clinical data, data sharing, genome-wide association studies, genomic data, GWAS, medical research, national human genome research institute, privacy