Editor’s Letter – Vol. 26, No. 3
Dear Readers,
The cover article in this issue of CHANCE is an intriguing report by Stewart et al. on the use of Bayesian Networks (BN) for prioritizing the conservation of a group of endangered species. The article also reminds us of the role modern statistical learning approaches can play in optimizing decision-making based on a large network of input variables. The authors consider a host of factors ranging from expert biological knowledge, costs of investment for conservation, political dynamics associated with prioritization for conservation, phylogenetic distinctiveness of the species, governmental and non-governmental support, and, more importantly, the urgency and timeliness of survivorship of species. Separate BNs are considered with the common objective of quantifying expert opinions; support for conservation from sources such as political, communal, and governmental; and the complexity of the environment, ecology, and geography surrounding the species. The elaborative nature of the BNs discussed in this paper, would allow for making delicate inferences such as “high-value and high-cost species were 60% more likely to receive investment (probability = 0.8) when the urgency is high (extinction within three years) than when urgency is low (decades) (probability = 0.2).” In my view, the kind of large-scale machination presented in this work should at the very least motivate discourse among statisticians on the ways they can positively participate in wide-ranging policymaking.
In other pages of this issue, Marcello Pagano and Sarah Anoke discuss the potential hazards of using regression when there is reason to believe one or more of the involved variables was measured with less precision than the others. This problem may be formalized in the context of correction for attenuation of correlation due to measurement error, but, more plausibly, can be an aftermath of data manipulations that might have been originally employed to establish a sort of “fairness” in the representation of data. The authors give an overview of the classical problem of “probable errors” in conjunction with the seminal work of Pearson and Lee, dating back to the early days of modern statistics at the dawn of the 20th century. They eventually construct a framework for tackling the problem via the class of error-in-variable models.
Elsewhere in the magazine, Paul Rosenbaum offers an engaging justification for applying differential comparisons in observational studies. Differential comparisons are simple in nature and are easy to apply. They can be used as an efficient remedy for scenarios in which non-measurable but non-ignorable biases can damagingly affect the estimation of the parameters of a statistical model. Using a case study comprised of an equal number of smokers and non-smokers, the author demonstrates the practicality of differential comparisons by contrasting daily smokers who have never used hard drugs versus the nonsmokers who have experienced that risky behavior. Rosenbaum shows us that the dangerous effects of lead and cadmium in tobacco are more vividly detected in the two above-mentioned subclasses of the study.
In his column A Statistician Reads the Sports Pages, Shane Jensen presents a Bayesian modeling framework for measuring the impact of the players in the National Hockey League (NHL). Pointing to the flaws of a frequently used regression model with a binary response, where each player on the ice is assigned a +1 when a goal is scored by their team versus a -1 for a goal conceded, Jensen offers two alternative approaches: a regularized regression model, in the vein of lasso regression, as well as a different treatment of the problem in which goals scored and received are viewed as two competing processes and thus can be analyzed with a Cox proportional hazards model.
In The Big Picture, the column editor Nicole Lazar tells the story of her stimulating experiences with sitting in on a course devoted to the analysis of symbolic data. Symbolic data are data aggregated from a large number of individuals. Such an unfamiliar territory would demand a fresh set of strategies for the exploratory and inferential data analyses. As discussed by Lazar, a possible approach is to represent data as interval-valued objects, followed by studying their centers and endpoints. To conceptualize symbolic data then, one would need an alternative set of tools such as interval-based scatterplots.
Andrew Gelman’s ethics column is developed around two complex stories: a paper appearing in a highly prestigious journal in social sciences suggesting “parental investments create a disincentive for student achievement,” and a separate case of authors refusing to acknowledge a label-coding error, but finding it sufficient to tag along their work with a “correction notice.” Gelman argues the problem might partly have to do with the increasing pressure on researchers to generate scholarly work having big-bang results, and partly because of the decrepitude of the current scientific environment that otherwise should greatly value a rigorous self-correction process.
—Sam Behseta