Big Data Computing
On December 5, 2012, the Georgia Advanced Computing Research Center (GACRC) at the University of Georgia (UGA), together with Enterprise Information Technology Services (which handles IT at the university) and our Office of the Vice President for Research, held “Big Data Lunch and Learn,” with an emphasis on campus computing resources for handling Big Data. The workshop was well attended. More than 80 people signed up in advance, but I’d estimate that more than 100 actually came.
The statistics department was happily well represented in the audience—out of the 24 people in the department, 10 from all ranks and types of position came to the workshop, indicating interest in the topic is strong.
I say we were well represented in the audience, but, sadly, reflecting a theme I’ve mentioned here before, not at the front of the room, which was dominated by computer scientists and researchers from the business school. Which is not to claim that these groups don’t also have a real and pressing interest in Big Data, but merely to point out, yet again, that we run the danger of being overlooked if we don’t push ourselves more to the forefront. Indeed, that very issue was a topic of much conversation among those of us who attended the lunch and learn—perhaps a good side effect in and of itself.
The morning opened with an overview of our local computing resources and how UGA is gearing up to be able to handle the large data sets that many researchers now encounter in their work. We were told about a contract the university has with Amazon Web Services, which I’ll discuss below. Then, faculty colleagues from computer science and the business school presented their perspectives and experiences with Big Data—why we should care, how we can deal with the issue, how infrastructure and data analysis need to change, what tools are out there and what needs to be developed. In all, it was an interesting and informative couple of hours.
Here are some things I learned. Amazon Web Services (AWS) is an example of a new paradigm called “infrastructure as a service” (IAAS). That is, instead of providing racks, cables, mainframes, etc., a virtual data center is created and everything—data storage, movement, analysis—can be done in the cloud (i.e., virtually). Clients can use the system on an “as needed” basis, setting up multiple nodes/servers virtually, running the job, and closing down the system.
This approach has the advantage of being fast and potentially cheaper, as there is no need to have servers running all the time waiting to be used. On the other hand, users are responsible for their own systems administration. Amazon and other IAAS providers give the infrastructure, but not the other administrative support necessarily.
Another concern with this model is data privacy; there are various federal policies (for example, the Family Educational Rights and Privacy Act, or FERPA; the Health Insurance Portability and Accountability Act, or HIPAA) that are in place to guarantee the confidentiality of certain types of data. A virtual computing environment would seem to make it harder to ensure that those safeguards are effective.
We also heard about the “three Vs” of Big Data computing: volume, velocity, and variety. (A recent article at ibm.com also mentions “veracity”—how much users trust the data—but this doesn’t seem like a Big Data problem per se.) Volume, of course, refers to how much data there are to analyze or process. As readers of this column know—or indeed, anyone who has been paying attention to the modern statistical landscape—a distinguishing characteristic of the state of data, and why everyone is suddenly talking about Big Data, is that there are masses of it, beyond the capacity of most of our machines. Some are generated almost automatically, as byproducts of our digital life (see my previous column); others come about from improved, advanced, or accelerated scientific processes—genetic microarrays, medical imaging, satellite imaging. Large quantities of data create challenges to traditional computing technology. Although storage is cheap (and growing ever cheaper) and data collection is also becoming easier and cheaper, analysis of massive databases cannot be handled easily (if at all) by conventional methods.
The second “V,” velocity, refers to the ways in which data will come into the system. Traditionally, data have been collected and moved through computers in what we can broadly term “batch mode.” A researcher might, for example, go into the field, record observations, and transcribe those as a data set in an Excel spreadsheet. Data are entered in chunks or batches, discretely. Nowadays, by contrast, we see a lot of “streaming”—functional magnetic resonance imaging data analyzed in “real time” as they are coming off the scanner; Internet traffic data constantly and automatically collected, stored, and possibly analyzed in real time or soon thereafter; images of far-away stars taken and sent at a fantastic rate back to computers on Earth.
Streaming refers to the way in which the data are collected (continuously, in a stream) and how they are processed or analyzed (roughly in real time or in actual real time). Data collected in a stream can still be analyzed in batch, but when the data are both processed and analyzed in a streaming fashion, there may be no need to store the entire collection, which might be a distinct advantage in some settings. In this model, data are passed through the system, analyzed, and discarded, making way for new pieces of information.
There is an interesting interplay, naturally, between velocity and volume. The Kepler spacecraft is essentially a large camera launched into space; Kepler’s photometer has an extremely high resolution of 95 megapixels. The camera is made of 42 charge coupled devices (CCDs), which are read out every six seconds and summarized on board every half hour. This process creates a quantity of data that it is not possible to store and transmit to Earth—the sheer volume means that transmittal is brought to a crawl. To circumvent this problem, the scientific research team had to make compromises, for instance focusing on just certain pixels of the image, reducing the volume to gain back some velocity. Interestingly enough, the original three and a half year mission (which would have ended in 2012) was extended to 2016 because of problems in processing and analyzing the large quantities of data Kepler was collecting.
A hallmark of modern data types is their variety. Whereas a traditional (also called “classical”) data set consists of some number of observations on some number of units, arrayed in an r × c table, where the value in cell (i, j) is the measurement of variable j (the column; c variables in total) on unit i (the row; r units in total), today’s data sets have an almost unlimited range of variety. Images, social network graphs, pages from the Web, movies, tweets, structured, unstructured, and semi-structured—data come in all forms and shapes. Many cannot be stored in spreadsheet arrays as in Excel, or in standard R data frames. We need to adapt storage (a computing issue) and analysis (a statistical issue) to these new modes of data. One can imagine furthermore that the variety of data types is going to continue to expand. As computer scientists and statisticians successfully meet the challenges of today’s unusual data types, the line for what constitutes “nonstandard” will shift again, opening the door for modes that we can’t even imagine.
Looking to the ways in which data are changing, we can recognize limitations of the traditional approach to computing and the accompanying technology. First, in some ways, the old familiar technology was designed for efficient transactions and storage, not analysis. When data sets are small and structured, there is no need to devote much thought or attention to what is broadly called “analytics” (statistics, data mining, and so on); as long as the data are stored in an accessible format, they can be analyzed by appropriate statistical techniques.
Second, supercomputers are expensive, hard to program, and hard to maintain. So, while they may have been a reasonable stopgap solution, perhaps we need to rethink this model as well.
Finally, statistical and data mining procedures tend to be centralized, that is, they work under the assumption that all the data are on a single machine. Increasingly, however, data are generated in multiple places; for example, a study might involve measurements from observatories around the world. It doesn’t seem that much of a stretch to argue that the data should be analyzed where they are collected (i.e., in parallel). This saves much in the way of resources, since scientists would not have to expend a lot of time on shuttling data back and forth to a central computer.
These limitations, themselves, suggest possible trends in computing for Big Data. Instead of moving data to one supercomputer to be analyzed, we can expect to see more “distributed” computing—clusters of servers used for a particular purpose in a local fashion, as in the AWS model. Along with this, of course, is more parallel computing, which is governed typically by high-level programming interfaces such as Java. Data management also needs to be simple and efficient, due to the size and complexity of the data sets in question.
New technologies are already being developed to handle this type of computing style: Hadoop, which is a system for performing cluster-based parallel computing; Hbase and Apache Cassandra, which are nonrelational distributed databases; MongoDB, an open-source document-oriented database system; Pegasus, Giraph, and GPS, all of which are designed to process and mine graphs.
If you explore the websites of many of these systems, you will see they tout themselves as being “fault tolerant,” decentralized, and scalable. The first of these means that data are replicated across multiple nodes or centers, so that if one fails, analysis can continue without down time. The second, as we’ve already seen, means there is not one single computer to which all data are transferred and on which they are analyzed. The third means that as data sets get bigger and even more complicated, the computing environment is able to adapt.
One of the presenters at the workshop summarized what he saw as the key ideas for the future as follows:
- We will see more analysis of globally distributed data on globally distributed clusters.
- Instead of exact analysis on the entire data set, we will see “approximate analytics”—analyses performed on parts of the data, or on a condensed version of the data, for example.
- Scalability of both data collection and data analysis will continue to be a crucial issue.
The first point I see mostly as one of hardware and technology—an engineering and computer science issue, not so much a statistical one. The other two, however, are quite clearly in our bailiwick as statisticians. Some of our colleagues are already working hard on developing methods for the analysis of large, possibly nonstandard, data sets that have been condensed in some way. I’m thinking here of the new fields of symbolic data and object-oriented data. More is needed though, leading us back again, intriguingly, to classical notions such as sufficiency and ancillarity of data summaries—what aspects of the data contain the relevant information about the parameters of interest (the sufficient statistics) and which do not (the ancillaries)? Even if we have to be satisfied with approximate solutions, we can surely assess the quality of those approximations by appeals to sufficiency. Some approximations, in other words, will be better than others will, and we want to keep our focus on those.
As for scalability, particularly of data analysis, here too we have an important role to play. We are equipped with knowledge of asymptotics—the large sample behavior of various data analysis techniques—as well as robustness and breakdown points of those techniques—what sort of data anomalies cause them to fail. It’s been discussed in our literature, especially those corners that deal with bioinformatics and imaging, but elsewhere as well, that traditional methods such as regression and analysis of variance don’t necessarily scale up when the number of variables is large. This is particularly a problem when the number of variables is larger than the number of observations, the so-called “large p, small n” scenario. But there are other instances, as well, no doubt. We can—and should—be active in discovering these, in exploring the limitations of existing data analysis methods when applied to large and unstructured data sets, and in suggesting alternative approaches that are more suitable to the current data landscape.
A final theme that was discussed near the end of the workshop was rather more philosophical than nuts-and-bolts technology of Big Data computing, and this had to do with a potential paradigm shift in science. Whereas the traditional approach, even in Big Data analysis, has been deductive, the speaker suggested that science might be moving to a more inductive paradigm. Deductive reasoning involves all the standard machinery of hypothesis testing, in which hypotheses are formed and then tested against the data using formal (statistical) rules. By contrast, with the inductive approach to science, the main tools are related to pattern identification and human interpretation of data observed in nature. Visualization is a key part of this process, as researchers elicit theories based on patterns they have identified in the data.
I don’t know if this last speaker was right in stating that science is moving (or needs to move?) in a new direction, nor were some of my colleagues and I convinced that the “new paradigm” is truly novel. It seemed to us that many statisticians already work in the so-called inductive way—exploratory data analysis (advocated by Tukey) and initial data analysis (advocated by Chatfield) have been around for decades, as have clustering and classification techniques and other informal data-driven methods for generating hypotheses. Visualizing the data and interpreting the patterns are also an intrinsic part of more modern approaches such as functional data analysis. Perhaps this perspective was a side effect of not having any statisticians involved in the planning of the workshop, although it was disheartening to see how little some of our colleagues on campus know about what we do and how we do it. I think we have good contributions to make to the data base management and computing aspects of Big Data, coming from our experience with thinking about data and how to manipulate them.
On the practical side, although it might be painful and involve a steep learning curve, I think we need to gain familiarity with parallel and cloud computing. We need to become comfortable with programming in python, perl, etc. Statisticians use big, complex, unstructured data sets and we should be at home in this new computing environment. It might be “too late” for those of us who are already overly busy with research, mentoring, and other professional obligations, but surely we owe it to our students to give them exposure to the new computing (not necessarily scientific!) world.
There is, without a doubt, a shift that needs to happen—and is happening—in how we manage, store, manipulate, and analyze data, whether or not this translates into a true paradigm shift in the Kuhnian sense. The statistician of the future will therefore most likely have a very different relationship to computing and computers than do we who are active today. We can begin now to prepare our students for that future, as many of them will spend their careers in this new and different environment. So, there is an opportunity (an obligation?) here for curricular reform, certainly at the graduate level, in how we teach statistical computing if nothing else, but maybe at the undergraduate level as well.
In recent years, I’ve been teaching an undergraduate capstone course aimed at our majors. This is a project course, in which teams of students work to analyze data sets from UGA scientists. Often, the projects involve data sets that are larger and more unstructured than anything the students have ever confronted—and whatever their strengths, SAS and Minitab simply don’t stand up to the challenge! Some of the projects might lend themselves to simple parallelization, and this could be a start, an easing into the world of Big Data computing for these students. There are some interesting avenues to explore here also, how best to equip our students for a still-uncertain and ever-changing work environment. I don’t have any answers, but leave the question for you to ponder. If you have thoughts on this, or stories of your own experiences, please feel free to pass them along.
In The Big Picture, Nicole Lazar discusses the many facets of handling, analyzing, and making sense of complex and large data sets. If you have questions or comments about the column, please contact Lazar at firstname.lastname@example.org.