The Odds of Justice: Dangerous Models?
“All models are wrong, but some models are useful” is the familiar George Box refrain. But for what are they used? Statistics has enjoyed (or suffered) a lot of attention during the COVID-19 pandemic. Maps, graphs, counts—all are perused by millions who listen to experts and politicians and then argue about “When will it peak?” or “Herd immunity versus lockdown?”
Statisticians are the first to ask for more data, but sometimes the last to ask what is being done with their analyses once the data have been acquired. Data scientists are working frantically to develop algorithms that will provide a better means of singling out groups and individuals as potential victims or those who have recovered. But then how will that information be used? Identification of individual characteristics for beneficial treatment is positive; identification for potential harm is not. And what might the potential harm be?
Statistics are, we know, used for all sorts of good purposes; there are excellent reasons for knowing as much as we can about the world in which we live and the people in it. In one of his many television interviews during the pandemic, geneticist Francis Collins, MD, head of the National Institutes of Health, described how much was learned from the Framingham study about heart disease and other conditions. Statisticians’ work in clinical trials has brought hope to millions. The effects of smoking cigarettes, often referred to as “coffin nails,” were known for years, but the revelation of the statistics of the risks involved brought about the broad abandonment of the habit. Statisticians have documented damage to the environment, and statistics helped to convince the U.S. Supreme Court that “separate” was not “equal” when it comes to education (Brown v. Board of Education, 347 U.S. 483 (1954))—and statistics show that it still isn’t today.
But once the data are there, we know that, with or without assistance of statisticians, they may be used for all sorts of nefarious proposes. As Seltzer and Anderson (among others) showed, the dangers of recording religion, citizenship, ethnicity, and health data, and then failing to protect those affected by policies based on the data, have been recognized at least from the time of the Holocaust to the 2020 U.S. Census and the current pandemic. Statisticians have the responsibility to engage in professional conduct in creating a statistical model; beyond that, what are their ethical and legal obligations as the gatherers and analyzers of the data?
It is easy, and perhaps comforting, to shift these concerns to those charged with implementing the policies based on our models, be they scientists, industry, or the government, but statisticians’ complicity cannot be denied.
Nor can it be said that having gathered and analyzed those data under guarantees of strict confidentiality and privacy absolves us. Recall that the U.S. Census Bureau breached the assurance it had given and identified not only geographic data to assist the location of people of Japanese origin during World War II, but also possibly information about individuals, to aid in implementing their transfer to internment camps.
Today, we have already seen the shunning of all things “Chinese” based on misinformation about the coronavirus; who might suffer if data from a model, machine-generated or otherwise, were used to institute harmful discrimination?
On the positive side, resistance by statisticians (and others) led to the omission of a proposed citizenship question on this year’s Census questionnaire when its inclusion would have compromised the constitutionally mandated purpose of the Census, as well as harmed those who would subsequently have suffered because they would not have been counted. Because the information for the announced purpose of the question—to combat voter fraud—could be found elsewhere, it could not be allowed to jeopardize the Census.
The release of otherwise-confidential data has often been justified by appeals to national security, but the detailed tabulation by publicly available five-digit ZIP code of those whose background was “Arabic” was explained as needed for airport signage purposes before being dropped. What did those asked to participate in the project discern as a possible outcome?
Other negative examples of the use of data abound: Population registries and special censuses enabled the forced migration of Native Americans, records were used to enforce “Jim Crow” laws in the post-Civil War South, data from the 1910 U.S. Census were used to identify and prosecute men who had not registered for the World War I draft, the identification of Jews and Roma in the Netherlands of World War II was facilitated by the thorough population data earlier assembled by the statisticians there, South African apartheid had data as its base, population statistics created under the previous colonial period in Rwanda contributed to the genocide there.
What might be the outcomes of algorithmic models developed by data scientists and others to identify characteristics of potential victims, locate real or potential hot spots of outbreaks, discover factors to mitigate the pandemic, or direct supplies where needed? Might an apparently benign virus registry of those with appropriate antibodies lead to conscripting them? Many have written about the problems that can arise if algorithms are used in sentencing, parole, education, employment, and elsewhere to determine the fate of individuals. If the data are there, how might an authoritarian or merely misguided regime use them?
We remember the fears that the Affordable Care Act would mandate algorithmically developed “death panels” to determine whether an individual had the characteristics that would merit treatment. Algorithms promoted as avoiding the discrimination found in other selection methods may introduce it themselves, in a particularly harmful way if their underlying basis is not known or cannot be discovered. The case of HIV/AIDS provides something of an analogy, since what is employed there is a way to identify those with the symptoms for the purpose of treatment and the security of contacts; statisticians were involved along the way in the battle to find management of the syndrome. The data have been collected in a variety of mandatory or voluntary opt-in/opt-out procedures. Unfortunately, discrimination, and in some countries even criminal prosecution, has fallen upon those identified as positive and sometimes on those who refused the tests (Gruskin, Mills, and Tarantola. 2007).
From the beginning of the identification of COVID-19, there has been a belief that “older” people are more vulnerable. (Disclosure: I am considered old.) This belief has been beneficial in that it has brought about protective actions (e.g., early hours at the local grocery store, isolation whether we asked for it or not and whether supported by the evidence or not) and lots of generous offers of help for us ancients, for which I am grateful.
Some grumbling or resentment has appeared, but few would endorse the suggestion of the lieutenant governor of Texas that old people should offer to die to protect the younger generations. Unfortunately, the emphasis on vulnerability of older folks has fueled a sense of invulnerability in the young, as demonstrated by the trip of 70 University of Texas students to the beaches in Mexico that has resulted in at least 60% of them being diagnosed with COVID-19 on their return. We know that in South Korea, New York, Iran, and New Orleans, religious gatherings have produced hot spots of the disease (if Mardi Gras can be considered religious), but COVID-19 is no more “religious”—any religion—than it is “Chinese.” Another widely circulated myth claims that the presence of 5G networks is what causes outbreaks.
There may be evidence of even-scarier correlations: Will the general public understand causality or lack thereof? For example, what if redheads appear unusually susceptible to COVID-19?
In collecting and analyzing data and constructing models for identification and treatment, participants—including statisticians, of course—must act in conformance with legal and ethical principles such as privacy and confidentiality and standards of professional conduct, but also must consider and be accountable beyond the work product for the implementation of their work. An emergency engenders legitimate national security and public safety concerns, but statisticians should not be complicit in trampling on personal security and fundamental rights.
Further Reading
Gruskin, S., Mills, E.J., and Tarantola, D. 2007. History, principles, and practice of health and human rights. The Lancet 370–9585, 449–455.
Seltzer, W., and Anderson, M. 2008. Using population data systems to target vulnerable population subgroups and individuals: Issues and incidents. In Statistical Methods for Human Rights, eds. J. Asher, D. Banks, and F.J. Scheuren. Springer, 273–328.
About the Author
Mary Gray, who writes The Odds of Justice column, is Distinguished Professor of Mathematics and Statistics at American University in Washington, DC. Her PhD is from the University of Kansas and her JD is from the Washington College of Law at American University. A recipient of the Elizabeth Scott Award of the Committee of Presidents of Statistical Societies and the Karl Peace Award of the American Statistical Association, she currently teaches legal and ethical issues in data science. Her research interests include statistics and the law, economic equity, survey sampling, human rights, education, and the history of mathematics.