P-values: What Do They Prove?


Statisticians face a difficult audience when asked to explain p-values, confidence intervals, standard deviations, and many more complex statistical models to judges and juries—especially when the lawyers involved may have chosen their careers with avoidance of mathematics in mind.

In one of a long series of jury discrimination cases, Swain v. Alabama in 1965, a 10% difference between minority representation of those eligible to serve on—and those selected to serve on—the relevant jury panels was deemed insufficient to trigger a retrial. In light of the numbers involved, the probability of such a result occurring by chance was about 1 in 108, an indication of possible discrimination not noted by the court. A decision two years later, Whitus v. Georgia, included a footnote recognizing that the difference in percentages might not be the best basis for a decision about the fairness of the jury selection process. It also described the probability of the result in the case as 6 in 106. However, the court noted that this information was not essential to the finding in favor of the defendant, as there were other factors contributing to the unfairness of Whitus’ conviction. [The case involved an African-American man who had been convicted by an all-white jury despite living in a community with a population that was 45% African-American.]

In subsequent cases concerning such issues as jury selection discrimination, employment discrimination, and the effect of pharmaceuticals, the courts have come to recognize the importance of evidence showing the probability of certain outcomes was not likely due to chance. Subsequently, courts have grappled with whether two or three standard deviations from zero or some other measure ought to provide a bright line for a finding that something other than chance was at work.

The courts’ difficulty with p-values in part echoes the same phenomenon seen in statistics classes—namely, that the p-value is thought of as the probability of the hypothesis rather than the probability of the evidence. For example, at the trial of a seminal Sudden Infant Death Syndrome (SIDS) case in the UK, Regina v. Clark, a pediatrician was asked about the probability of two SIDS deaths occurring by chance in one family. Using a probability of 1 in 8,543 from a study about the prevalence of SIDS, he squared this value to obtain “approximately a chance of 1 in 73 million.” Not until the second appeal of Sally Clark’s conviction did the legal system recognize that this evidence may have led the jury to believe that this was the probability of Clark’s innocence. This is known as the “prosecutor’s fallacy” It’s found all too frequently, even dating back to the 19th-century Dreyfus case (although the statistical “evidence” in that case was hardly dispositive and was challenged by Henri Poincaré and others).

In a surprise to many, including statisticians, the Supreme Court in 2014 concluded, in Matrixx Initiatives v. Sircusano, that probability could be important even if not “statistically significant” under some generally accepted standard. The setting was an unexpected one—securities fraud. Basically, the issue was whether investors were materially misled by Matrixx’s failure, in its public announcements, to disclose that there had been a considerable number of adverse reactions (loss of sense of smell) to the use of its cold remedy, thus presenting an unduly optimistic outlook to investors.

Although Bayesians may see the courts’ confusion in a variety of settings as an opportunity for their favored approach, it has received little support.

The recognition that there is nothing sacred about a p-value is welcomed by many, as is the attention given by the Supreme Court to the importance of measurement errors in Hall v. Florida, a case involving developmental disability. In Atkins v. Virginia in 2002, the Supreme Court ruled that executing those with mental retardation violates the Eighth Amendment. In Florida, a base criterion for deciding whether a defendant is developmentally disabled and thus cannot be sentenced to death was an IQ of 70 or less. In Hall v. Florida, the court—basing its decision on the measurement errors inherent in the determination of IQs and on the need to consider developmental factors other than IQ—concluded that Hall was entitled to consideration as possibly being developmentally disabled, even though his IQ of record was 71.

Attention to jury selection and determination of eligibility for the death penalty have also involved an extensive and generally inconclusive battle among statistics experts about the death penalty’s deterrent effect on crime. Numerous studies have shown that the race of a murder victim is strongly predictive of whether the death penalty is imposed, yet that so far hasn’t convinced the courts of the unfairness of the death penalty in a particular case.

The briefs in the landmark Brown v. Board of Education might be said to mark the beginning of evidence-based decisionmaking for the courts. At issue in that case was whether the massive amounts of data on the effects of “separate but equal” showed that, when it comes to racial segregation in public schools, separate can never be equal. But what statistics can or cannot show still hangs in the balance. The Supreme Court’s 2013 decision in Fisher v. Texas returned the issue of affirmative action to the lower courts to decide whether an admission plan designed to supplement a policy of admitting the top 10% of each Texas high school graduating class could be defended against charges of reverse discrimination. Rival statistical studies before the court had focused on whether having a single minority student in many classes was sufficient to provide the critical mass needed to overcome the adverse effects of isolation, and on whether students benefiting from affirmative action admissions became successful graduates. On remand, the Court of Appeals for the Fifth Circuit subjected the Texas admission plan to the “strict scrutiny” required by the Supreme Court decision and once again found it acceptable under the affirmation action standards of Grutter v. Bollinger [a landmark case that upheld the admissions policy at the University of Michigan’s law school].

Before the court currently is a case involving the redrawing of legislative districts—long a fertile ground for statistical analysis. In earlier cases, the question has usually been whether districts have been structured in a way that limits minority representation in each district—thus effectively suppressing the voting influence of minorities. In Alabama Legislative Black Caucus v. Alabama (decision expected 2015) the question, however, is not whether there are too few minorities in a district, but rather too many. That is, have so many minority voters been crammed into a single district that there is no possibility of electing more than one legislator whom minority voters might prefer?

Statistical modeling has found a place in all sorts of litigation, both benign and harmful. For example, while the use of multiple factors in deciding degree of guilt in individual cases may represent good evidence-based decisionmaking, a recent tendency to base sentencing on factors predictive of recidivism is less so. Demographic factors such as socioeconomic background, race, and neighborhood are now sometimes being used to decide the length and nature of the penalty. But punishment should be for what one has done, not for whom one is. Or what is justice?

Mary Gray
Mary Gray is professor of mathematics and statistics at American University in Washington, DC. Her PhD is from the University of Kansas, and her JD is from Washington College of Law at American. A recipient of the Elizabeth Scott Award from the Committee of Presidents of Statistical Societies, she is currently chair of the American Statistical Association Scientific and Public Affairs Advisory Committee. Her research interests include statistics and the law, economic equity, survey sampling, human rights, education, and the history of mathematics.

Back to Top

Tagged as: , , , ,