Calling All Statisticians for the Next Wave of Biomedical Big Data Discoveries
Our increasing ability to acquire, store, and analyze large volumes of heterogeneous biomedical data has the potential to revolutionize our society’s understanding of human health and disease. The field of data science—the science of extracting knowledge from data—provides tremendous opportunities to use the data that we are generating to address some of our most daunting yet fundamental biomedical questions, such as “How does the brain work?” Currently there is no unifying theory or model for how the brain works. Answering these questions will require teams of researchers with an appropriate blend of domain knowledge and computational and quantitative expertise.
Despite the hope that data science promises, we are faced with a dearth of skilled researchers who can mine and extract meaningful information from the vast amount of heterogeneous biomedical data. The field needs creative statisticians who are facile with computational ideas and paradigms, to apply their skill set to the challenges of biomedical Big Data science. Although statistical challenges are plentiful in all areas of biomedical science, in this article we highlight the particular challenges addressed by two complementary and large initiatives from the National Institutes of Health (NIH)—Big Data to Knowledge (BD2K) and the Brain Research through Advancing Innovative Neurotechnologies (BRAIN)—and describe the opportunities available for statisticians to accelerate advances in these initiatives.
The BD2K and BRAIN Initiatives
Recent initiatives stimulated by President Obama and the NIH provide opportunities for those scientists, including statisticians, who are able to turn Big Data—often described by the three Vs (volume, velocity, and variety)—into the discovery of knowledge. In the fall of 2014, the NIH announced new grant awards totaling $46 million for the BRAIN initiative and $32 million for the BD2K initiative. Both BD2K and BRAIN are bold, new NIH initiatives that complement the existing investments of NIH institutes and centers and other federal agencies that work toward the president’s most ambitious goals in science and technology.
The NIH BD2K initiative was established to facilitate the broad use of biomedical digital data assets, through training researchers and supporting research that aims to develop the methods, software, and tools needed to analyze biomedical Big Data. In contrast, the BRAIN initiative is focused on revolutionizing our understanding of the human brain by investing in the development of transformative technologies and methods to map the circuits of the brain, and to understand how these circuits create our unique cognitive and behavioral capabilities. These initiatives highlight the increasingly louder call to the statistics community to apply statistical approaches to the voluminous data acquisition efforts currently underway across neuroscience and many other biomedical, clinical, and behavioral science disciplines.
Management and analysis of the diverse types of data being generated in response to the BRAIN initiative require new tools and methods to accelerate discovery. Heterogeneous types of data from roughly 86 billion neurons and trillions of connections are being explored with a myriad of technologies. They will require thoughtful integration in order for us to understand how the human brain controls our behavior at the speed of thought.
The field of neuroscience faces a resulting volume of complex information that is growing exponentially due to coupling the sheer number of neurons and connections in the brain with the reduction of the barriers for recording data by advancing neurotechnologies. Neuroscience data will increasingly be acquired with greater spatial and temporal resolution, presenting a daunting challenge for data storage and processing. Truly understanding a circuit requires identifying and characterizing the component cells, defining their synaptic connections, observing their dynamic patterns of activity as the circuit functions in vivo during behavior, and perturbing these patterns to test their functional significance. While these steps are clear, they present immense computational and methodological tasks. New tools and methods are required to achieve this goal. The individual steps inherently require statistical skills, such as understanding of the experimental design, statistical inference, dimensionality reduction, and scientific rigor to enhance research reproducibility.
Like neuroscience, many other biomedical disciplines are faced with an explosion in data. In genomics, the steep decline in sequencing costs has led to a sharp increase in usage and resulting voluminous data. In the clinic, the proliferation of electronic medical records creates large amounts of unstructured text that can be mined for information. In behavioral science, observational data from social media, wearable sensors, and other technologies are increasingly available. In many fields, multiple types of data are being tasked to purposes unimagined at the time of collection. Data sets are serving new purposes both individually and by being integrated with other data, whether of the same or diverse types. Data discovery and integration present computational and statistical challenges, but once those challenges are overcome, the opportunities for scientific advances are tremendous. Data are digital research objects; other types of digital research objects include software, workflows, publications, and training materials. Biomedical science is increasingly becoming a digital research enterprise, and the value of such an enterprise will only be fully realized when relevant data can be found and reused by scientists. The BD2K initiative is the extramural funding component of the effort to achieve the NIH goal of fostering the digital research enterprise.
Both the BRAIN and BD2K initiatives aim to develop new methods to analyze complex data, and the BRAIN initiative also has a focus on neurotechnologies. Neurotechnologies, which include lowering the barriers to recording neural data at unprecedented scale, are needed along with novel approaches for analysis to expand our knowledge of normal and aberrant brain function. Compared to the BRAIN initiative, BD2K encompasses broader scientific and data science areas and de-emphasizes data collection and the development of technology to collect data. BD2K spans the data types and domains of NIH, including neuroscience, cancer, heart disease, infectious diseases, and behavioral science. BD2K also spans the data pipeline, from the infrastructure to share and find data and software, to the expansion of analytical methods for biomedical data. Both initiatives include dedicated efforts to train the biomedical workforce in data science.
BD2K complements the BRAIN initiative by fostering the development of the digital biomedical research enterprise. A key element of the digital enterprise is “The Commons,” the moniker indicative of its potential ubiquity and shared ownership by the research community, including statisticians. The Commons is a conceptual framework for a digital environment to allow efficient storage, sharing, and usage of research objects—which includes data and software utilized for experimental research such as that planned for the BRAIN initiative. The Commons provides a platform for implementing requirements based on a directive from the U.S. Office of Science and Technology Policy requiring federal funding agencies to make plans to increase public access to research results generated with public dollars, including publications and data, as far as feasible and consistent with existing law and policy.
The sheer scale of the data acquired through NIH support, including through the BRAIN initiative—and the opportunities for data access and multi-modal integration through BD2K—will allow the rate of knowledge discovery to rapidly increase through creative and persistent re-use of data. The sharing of data and the development of data analysis tools are central to the BRAIN initiative. Neuroscience data such as broad-scale neuronal recordings coupled with behavioral observations do not lose their utility after a single analysis, presenting a great opportunity for statisticians efficient at data wrangling, visualization, modeling, and analysis; statisticians with these and other computational skills are a type of data scientist. As NIH Associate Director for Data Science Phil Bourne wrote in a recent blog post, “sharing implies finding, using, reusing and attributing.”
The Commons will make accessing data and integrating multiple diverse types of biomedical data easier. Doing so has obvious implications for the rate and cost of discovery in medicine, provided these data can be transformed into knowledge through careful analysis based on principled approaches taking uncertainty and bias into account. The Commons, by making data with fewer “data janitor” requirements widely available, will reduce the time investment for statisticians to apply their skill set to biomedical and behavioral science. Although exploring biomedical data is already possible because many data sources are online, the Commons will make them easier to find and use. This availability of data opens the door to tackling new challenges. Exciting statistical challenges are offered throughout biomedical science. Examples of statistical challenges include:
- Inference based on models that integrate multiple data sources such as genomics, multi-modal imaging, behavior, and environmental monitoring;
- Scalable algorithms implementing principled approaches for prediction and inference through joint minimization of statistical risk and computational run time;
- Selection of dimensionality reduction methods;
- Causal inference from complex data;
- Real-time and online processing and analysis of streaming data, as needed for an efficient brain-computer interface and other applications.
Statistical methodology development and applications of statistics are funded by NIH institutes and centers, as well as by BD2K. For a list of related funding announcements and recently funded projects, see the NIH Biomedical Information Science and Technology Initiative and BD2K web pages.
Despite the interesting challenges, neuroscience and other biomedical applications present barriers for statisticians with little biology in their backgrounds. Expert domain knowledge is essential to guide development of statistical models, as well as to understand the theories and concepts that they represent, whether about the brain or a disease. Through collaboration with domain experts, breakthroughs in both statistics and biomedical science will follow. These collaborations, however, will only be successful if each side understands the language of the other.
Fortunately, the transition to fluency in the languages of biomedical science is facilitated by the plethora of media providing on-demand learning opportunities, including massive open online courses (MOOCs), modules, and introductory books. MOOCs and other open educational resources provide free, easily accessible educational materials aimed at a broad audience. Data science biomedical challenges, run by companies such as Kaggle, provide pre-wrangled data, allowing a quick start-up for exploration by data scientists. The challenge forums provide informal places to exchange ideas and learn from the experience of others. Finally, books and websites written for the general public help make expanding into a biomedical field easy and fun. For example, Oliver Sacks provides an accessible view into the fascinating world of neuroscience through his popular books such as The Man Who Mistook His Wife for a Hat. With the vocabulary, context and overview from popular books, scientific textbooks and publications become more accessible. With background knowledge gained through open educational resources and other media, an emerging biomedical data scientist can more easily work with biomedical collaborators. Together, biomedical scientists and computational statisticians can interrogate data in order to tackle urgent biomedical problems.
Cogito ergo sum, “I think, therefore I am”—the philosophical proposition by René Descartes—has defined us in many ways since it was published in 1637. Perhaps this fascination with neuroscience helps to explain the analogy of the BRAIN initiative with the first manned moon landing, the achievement that consumed the nation in 1969. The goal to completely characterize and understand how we think, and to apply that knowledge to alleviate the staggering toll from diseases of the brain, undoubtedly will be a great achievement—only one of many that will be made possible with new data acquisition and analysis tools. Understanding the human brain is just one example of the many advances that could be facilitated with data that is openly shared, accessible, findable, and annotated. An ecosystem that fosters the reuse of data, by combining like data or mashing up diverse data, will enable new developments and efficiencies yet to be imagined. These new advances will be realized through the hard work of a new generation of scientists, including statisticians, who are comfortable with biomedical science, who can manipulate Big Data, and who can interpret meaning through understanding the theory and principles behind the algorithms and the inherent uncertainty associated with the data, models and conclusions.
This article is a call to action—for statisticians to spread beyond the mathematical roots of statistics to a budding computational future where Big Data challenges are tackled side by side with biomedical and behavioral scientists.
About the Authors
Thomas Radman’s grants portfolio at the National Institute of Drug Abuse includes computational neuroscience, big data analytics, and electrophysiology. His previous experience includes review of Neurological and Ophthalmic Diagnostic and Therapeutic Medical Devices at FDA, and Algorithm Development at BrainScope, a machine learning company interested in the diagnosis of traumatic brain injury using EEG recordings.
Erica Rosemond is a program officer in the Division of Neuroscience & Basic Behavioral Science, in the Office of Research Training & Career Development at the National Institute of Mental Health. She earned a PhD in human biology, specializing in the neurosciences, at the University of Toronto in Canada. Apart from her portfolio of fellowships, career development awards, and training grants at the NIMH, she leads efforts in the BD2K and BRAIN initiatives related to training at the National Institutes of Health.
Michelle Dunn is senior adviser for Data Science Training, Diversity, and Outreach at the National Institutes of Health. As a statistician by training and a healthcare consumer by lack of luck, her interests all relate to the use of statistical thinking to improve the conduct of biomedical research.