Big Data and Privacy

Nicole Lazar

When thinking about the topic of this issue’s column, I batted around various ideas before settling on the theme of privacy in the age of Big Data. I knew I was on to something when, a few days later, my copy of Significance appeared in my mailbox and on the cover was the same theme!

I actually think about the questions of privacy and anonymity a fair bit because I am by nature a private person and don’t like to share my information—online or otherwise—unnecessarily. At the same time, I do have an online presence as an academic. Going “off the grid” does not strike me as realistic or desirable.

Where is the balance? I’ll admit that at times I might go too far in the direction of discretion. An anecdote: When I was in my first assistant professor job, living in Pittsburgh, a local grocery chain had a loyalty card. When you bought groceries, you would swipe the card, and in return the store would print coupons that matched your shopping patterns and would give you discounts from time to time, etc. The usual things. One of my senior colleagues told me (with a gleam in his eye that only a true lover of data would have at such a moment) that the grocery chain had approached him about analyzing the masses of data that would be accumulating each day.

As a statistician and lover of data myself, I could see why he was intrigued. This was a rather early example of data mining: looking for patterns in shoppers’ baskets of groceries. What’s the likelihood that a customer will buy bread, milk, and eggs on the same trip? Do shoppers who buy beer tend to buy chips as well? You can think of your own examples. Again, this is all pretty common now, but at the time, for me at least, it seemed to open up new vistas of data analysis—and it was very exciting.

That’s how the statistician in me reacted. The other me, the individual, didn’t want one of those cards. I didn’t want the grocery chain to know what I bought, when I bought it, what my patterns were. It was none of their business! When my boyfriend, now husband, left Pittsburgh, he gave me his loyalty card. And I’ll admit, my first thought was “Oh, good, now I can mess with their data!” I was amused to think about my colleague or one of our graduate students mining this huge accumulation of data and seeing a blip or switch in the record for that card.

Of course, I knew that it didn’t really work that way, which brings me around to the bigger point—namely how much anonymity or privacy do we really have in a situation like this?

The question is very much in the air in early November as I write this column. A few days ago, a story broke about Harvard University photographing classrooms in an effort to monitor student attendance. Cameras in lecture halls took pictures every minute, and the numbers of empty and full seats were counted, presumably by some pattern-recognition software.

Harvard’s Institutional Review Board ruled that the practice didn’t fall under the purview of human-subject research. Faculty members were apparently informed about the study while students were kept in the dark. Had they known, students likely would have modified their behavior by coming to class more than usual. (Or, for those who like to mess with the data, maybe skipping class when they wouldn’t ordinarily!)

The study was conducted in spring 2014 and made public in the fall. Students were outraged, as were many faculty, claiming that Harvard had invaded their privacy and administrators should have been more forthcoming about the project. It didn’t help that this followed another scandal at Harvard, in which the administration had apparently been scanning emails of particular individuals. Although school officials destroyed the classroom pictures immediately, and were at pains to insist that individual students per se were not of interest, many students, faculty members, and observers condemned the practice, blasting school officials on comment boards at The Chronicle of Higher Education and The Harvard Crimson.

Others, though, wondered what the fuss was all about, pointing out that data (including video and photographic) are collected about us all the time, so what’s one more data mining experiment? Some also argued that there is no reasonable expectation to privacy in a college classroom, especially at a private institution that can, in many respects, set its own rules.

That this is one of the pertinent issues of the Big Data age is made evident also by a recent report—“Big Data and Privacy: A Technological Perspective”—submitted in May 2014 to President Obama by the President’s Council of Advisors on Science and Technology (PCAST). Concerns about privacy and anonymity have long been coupled with changes in technology, dating back to the establishment of the U.S. Postal System in 1775, and on through the inventions of the telegraph, telephone, and portable cameras. The PCAST report surveyed the ways in which data collection, analysis, and use have converged in the modern age to make these issues particularly fraught.

In the past, it was easier—even feasible—for an individual to control what personal information was revealed in the public sphere. This is no longer the case. Public surveillance cameras and sensors record data without the individual being filmed or recorded necessarily even aware that it’s happening.

Additionally, social media such as Facebook and Twitter expose a great deal of personal information—sometimes intentionally, sometimes not. The emergence of statistical techniques for the analysis of disparate data sources—a hallmark of Big Data—means that even if the information disclosed in one database maintains the individual’s privacy, when taken together with other (also privacy-protecting) databases, identification of the individual is possible, and maybe inevitable in some cases. There are legal precedents governing some of these practices and concerns, but not all, and in any event they’ll all need to be revisited in light of the evolving technological landscape.

Part of the challenge inherent in the modern paradigm, as noted in the PCAST report, is that a certain application may bring both benefits and harm, whether intended or not. For example, government is limited by the Fourth Amendment to the U.S. Constitution from searching private records in a home without probable cause. This seems straightforward until you think about what constitutes the boundaries of your home in a WiFi, cloud-enabled world. The same technology that allows you to put family photographs and documents into, say, the cloud, to share with friends and relatives who live far away from you, also blurs the definition of what constitutes your home. Do the same legal protections extend there? If not, how might your privacy be compromised?

That may be a question for lawyers. How about one for the statisticians? The PCAST report emphasizes that much of the benefit of Big Data comes from the data mining aspect—the ability to detect correlations that may be of interest or use. But it’s important to keep in mind that these are just correlations, which means that the discovered relationships do not hold in all generality, and in particular may not hold for certain—and possibly vulnerable—sub-populations. Harm in a medical context, for example, could arise from mistaking correlations for hard-and-fast rules. And as I’ve already noted, additional harm can come from merging data sets that were not intended to be merged, and subsequent analysis revealing facts about an individual that she never meant to make public.

The PCAST report also distinguishes between two types of data especially relevant for the privacy discussion: “born digital” and “born analog.” Data that are born digital, as the name implies, are created specifically for use by a data-processing system. Examples include email and short message services, data entered into a computer or cell phone, location data from a GPS, “cookies” that track visits to websites, and metadata from phone calls. In all of these cases, there is intent, at some level, to provide the data to the monitoring system. Privacy concerns here stem from two main sources—over-collection of data and fusion of data sources. Data over-collection occurs when the system, intentionally or otherwise, collects more information than what the user was aware of. An example from the PCAST report is an app called Brightest Flashlight Free, which millions of people have downloaded. Every time it was used, the app sent details of its location to the vendor. Why is this information necessary for a flashlight? Clearly, it isn’t, and hence this is an instance of over-collection. The violation of privacy is obvious in that someone who downloads a flashlight app for his phone is not expecting to reveal data on his location every time he uses it. To make it worse, the location information was apparently also sold to advertisers.

Data fusion is the term used when data from various sources are combined and analyzed together using data mining or other Big Data techniques. Even if each source on its own provides adequate privacy protection, when multiple sources are analyzed together they may draw a picture that is detailed enough to reveal specific and confidential information at the individual level. The challenge with data fusion is to devise analysis approaches that preserve the rights of the individual to control what she reveals about herself to the world at large. This is an area of current research in statistics.

In contrast to data that are born digital, born analog data originate in the physical world and are created when some features of the physical world are detected by a sensor of some sort and converted to digital form. Examples include health statistics as collected by a Fitbit, imaging infrared video, cameras in video games that interpret hand or other gestures by the player, and microphones in meeting rooms. When a camera takes pictures of a busy downtown neighborhood, it is bound to pick up some signal that is not of immediate interest. Hence born analog data will frequently contain more information than intended. Again, this can result in benefit as well as in harm, depending on how the data are used. Once the born analog data are digitized, they can be fused with born digital data, just like any other source, and analyzed together as well.

Given this new state of reality, is it even possible to protect one’s privacy? The answer to me seems to be a cautious yes. There is something of an arms race in which people work to crack protections, which in turn spurs the development of more sophisticated protective measures, which then pose a challenge to the first group, and on and on. Encryption is one way to make data more secure, although codes can be broken or stolen. “Notice and consent” is most often used for the protection of privacy in commercial settings. We all have encountered these when we want to install new software or a new app and are supposed to read and agree to a long list of terms before proceeding with the download. Among those terms are items relating to how your data may be used—for example, sold to advertisers or other third parties. But how often do you read those notices through and through? Always? Sometimes? Never? For most people it’s probably the last option, and this is a problem because it shifts the responsibility from the entity collecting your data back to you. The practice is even more problematic because not everyone is a lawyer, and so even if we take the time to read the terms, many of us don’t understand the implications and can’t object if some of the terms seem unreasonable. In addition, the provider can change the privacy terms down the line without informing you of that fact.

Since notice and consent is the most prevalent model, and given the problems I just described, the PCAST report recommends that major effort be put into devising more workable and effective protections at this level, in part by placing the responsibility for using your data according to your wishes back onto the providers—those collecting the data. One proposed framework is to have a variety of privacy protocols that emphasize different utilities. You would sign on to such a protocol, which would be passed on to app stores and the like when you use their services. The protocol you chose would dictate how your data could be used, shared, disseminated, etc. Or, the groups offering the protocols could vet new programs and apps to ensure that they meet the desired standards. In either case, consumers would no longer need to wade through screens of legalese (or ignore them altogether!); the burden would be on the other parties to the transaction.

There is another perspective I should add, and that is symbolic data analysis, on which I have written here in the past. Recently my colleague Lynne Billard gave a talk in our weekly colloquium series about this approach to data analysis. She mentioned that one of the ways in which a statistician may receive a data set for which symbolic methods are appropriate is through aggregation. Her motivating example was from a medical data set, where the insurance company almost certainly doesn’t care about my visits to the doctor; it cares about visits of “people like me.” This provides a type of built-in anonymity for these massive, automatically generated data sets if they are indeed analyzed in an aggregated fashion.

Of course, this raises other interesting and pertinent statistical questions: What does it mean to be “like me” for the purposes of this type of analysis? Who defines these categories? How sensitive are the analysis and its results to the particular aggregation scheme? Presumably the questions that the insurance company is asking should guide the aggregation. In some cases, maybe my age is relevant; in others, it may be my height, weight, cholesterol levels, blood pressure, and so on. I can think of many dimensions along which someone may be “like me.” Some will be important for certain types of analysis and not for others, but in any case, “I as me” will perhaps not be so critical.

These questions of data collection, analysis, and usage, and how they intersect with personal rights to (or desire for) anonymity and privacy, are not going to disappear. They are part of our cultural and technological landscape and are likely to expand over time. I think it’s important for us to think about these issues and to decide where our personal lines are, how much we are comfortable sharing about ourselves, and what sort of presence we want to have in the digital and other modern realms.

So, to all of my students, former students, collaborators past and present, and old friends who try to connect via LinkedIn, Facebook, ResearchGate, and the like—when I don’t respond, just know that I don’t participate in any of those fora. It’s my small way of keeping a corner of privacy in the world.

Big Data and Privacy

Further Reading

More in Columns

More in The Big Picture

Departments

Links