Editor’s Letter – Vol. 25, No. 2
Data analytics is a new catchphrase in statistics and computer science, recently appearing in the mass media such as The New York Times. Scotland Leman and Leanna House, authors of this issue’s lead article, define the term as a data analysis approach using not only statistical methods, but a number of instrumental techniques in data mining, computer science, and machine learning. The learning component in the analytics can be used as a teaching engine for communicating novel approaches in data analysis in a robust and interactive fashion. For example, the authors demonstrate how their discovery-via-visualization approach would greatly improve students’ appreciation of the subtleties associated with dimensionality reduction in the analysis of Fisher’s Iris data. Moreover, the authors provide an alternative to the traditional teaching paradigm of presenting underlying formulas, followed by giving a numerical example. The article thus should be considered by readers interested in statistical pedagogy.
In “The Family Tree of an Epidemic,” Adam Kucharski gives an introduction to another rapidly growing research area, namely modeling various evolutionary phases of an epidemic via applying dynamical models, time series analysis, and social networks, which are commonly used in phylogenetics. Kucharski’s main proposal is that in addition to analyzing the time series of incidence, modern epidemiology can hugely benefit from genetic data.
In a unique exercise in documenting what goes into earning a PhD in statistics, Alden L. Gross infuses data visualization, humor, and sharp social commentary to address a series of philosophical and pragmatic concerns along the lines of “Why does graduate school take so long?” and “Are you missing out on the rest of life?”
Maya Bar-Hillel and Ro’i Zultan provide a remarkable demonstration of the so-called middle bias in decisionmaking through the analysis of gambling bets in Roulette. A carefully composed contour map of all the bets made within a fixed period of time vividly exhibits the middle bias when the position of the bettor around the Roulette table is taken into account.
Laura Taylor builds a competing-risks model with recurrent events using play-by-play data collected from the 2011 NCAA Men’s Basketball Tournament. Taylor’s approach highlights the adaptability of the competing-risks method for analyzing sport data via six events: points scored and allowed from free throws and two- and three-pointer shots.
In her column, The Big Picture, Nicole Lazar writes about multiple testing and multiple comparisons in the analysis of large data. Lazar also gives the reader a tour of influential approaches such as Benjamini-Hochberg’s method of controlling the false discovery rate. Lazar also hints on effectiveness of Bayesian modeling for incorporating various types of dependence structure in multiple testing.
Mine Çetinkaya-Runder, Dalene Stangl, and Kari Lock Morgan create a lesson plan using Google’s Transparency Report, published in October 2011. The report consisted of the frequency of the requests, originated in different countries, for removing specific content from the Internet. The authors show that a data set including the number of users and accounts for whom data was requested, population estimates for each country, human development index, freedom of the press, and democracy indexes, can be used effectively for teaching exploratory data analysis and statistical inference.
In A Statistician Reads the Sports Pages, Shane Jensen revisits Fredrick Mosteller’s 1952 Journal of the American Statistical Association article, “The World Series Competition.” Similar to Mosteller’s classic, Jensen concludes that the better team—defined as the team with a superior performance in the regular season—will not have a clear advantage in the World Series. Jensen supports this claim by examining the time series of the proportion of series won by the better team and by accounting for the number of upsets in the history of the series.
Howard Wainer discusses a serious issue with modeling educational data: imputing nonexistent data via zeros. Wainer shows, quite effectively, that how and why even simple imputation techniques say, the employment of a regression model can drastically improve the quality of school rankings. Wainer showcases his argument with a case study of ranking the performance of the Promise Academy in Memphis, Tennessee.
In Ethics and Statistics, Andrew Gelman tells us about two disturbing stories of violating ethical practices in medical trials. The first is about testing drugs on undocumented immigrants; the second deals with the political-social-ethical—and indeed statistical—layers related to the breast cancer treatment drug Avastin.
Alessandra Iacobucci, guest reviewer of this installment of book reviews, highly recommends Art of R Programming by Norman Matloff. Meanwhile, the book editor, Christian Robert, is on a roll, reviewing Understanding Computational Bayesian Statistics by William Boldstad; Bayesian Ideas and Data Analysis by Christensen, Johnson, Branscum, and Hanson; and Bayesian Modeling Using WinBUGS by Iannis Ntzoufras.