Statistics: From College to Pre-college

Robert (Bob) W. Hayden

Statistics education at the pre-college and college levels is interrelated in many ways. This article focuses on how current and future changes in college introductory courses may affect what happens in pre-college courses. Along the way, we will examine other relationships.

The principal college courses of interest are introductory courses taught in a statistics or mathematics department. In high schools, the main course of interest is the Advanced Placement (AP) Statistics course, for which high-school students may get credit for a college statistics course. Some high schools also offer non-AP Statistics courses. Finally, the Common Core State Standards (CCSS), adopted by most states, call for a considerable amount of statistics work to be included in the high-school education of all students.

As just one example of the type of issue under discussion, AP Statistics is intentionally designed to mimic the college courses for which students are most likely to get credit. There is some concern in the AP community about the AP syllabus not changing since the course began nearly 20 years ago. Does this imply there has been no change in the college courses?

The College Board says (and many high-school teachers agree) that there has been little change in college statistics courses over the past 20 years. They base this conclusion on periodic surveys of college courses by the College Board. Meanwhile, some in the college statistics education community say their courses have changed dramatically in the same time period. How can we explain this difference in outlook? I think it stems from two main sources.

First, the College Board and the people active in college-level statistics education are sampling different populations. The College Board is primarily interested in college courses for which a good AP score provides credit. These are often courses in the mathematics department taught by nonstatisticians. Such courses are usually the last to change. The college statisticians who are active in statistics education are looking at their own courses. These are the first courses to change. From the perspective of the statisticians teaching such innovative courses, both the AP course and the college courses taught by nonstatisticians are behind the times.

A second reason for differing views about the amount of change in the colleges stems from an increase in variability in what is happening in college courses. When AP Statistics began 20 years ago—the textbooks by David Moore were widely adopted in the courses the College Board surveys, the model for AP Statistics, and close to what the college statistics education community was teaching. Since then, the college statistics education community has diverged from both the AP course and the course typically found in mathematics departments.

This divergence of approach creates both problems and opportunities for AP Statistics. The course does, indeed, have to stay close enough to the “typical” course for AP students to gain credit for that course. But in doing this, the AP program will suffer if it results in the course being considered seriously behind the times in the collegiate statistics education community.

The College Board will be considering revisions to the course in the next year or so. What I would like to see the College Board do then is to consider not only where the colleges are in teaching statistics, but where they are going. We could then seek ways in which the AP course could be improved without becoming so innovative that it no longer matches that “typical” college course. To do that, we have to look at the kinds of changes taking place in the courses offered by the leaders in college statistics education.

What’s New?

The current changes in college introductory courses reflect changes in the discipline of statistics. Two major such changes in statistics have recently been changing college introductory courses, although those changes are far from universal today.

The first change to occur was an increased use of resampling methods (see sidebars for examples). These are now thoroughly implemented in the commercial college introductory text by the Lock family, a free online text by David Diez, et al., a high-school text by Tabor and Franklin on statistics in sports, and others. These methods were first incorporated into an introductory textbook by Julian Simon in 1969, and have been part of at least some college courses ever since.

Resampling is an alternative method of statistical inference that relies on computer simulations. It works in many of the places traditional methods work, but also in many places traditional methods do not work well. In the last 10 years, resampling has become popular in introductory courses because many people find it much more understandable to beginners than the more-mathematical traditional approach.

A more recent change in college statistics courses has been the result of a desire to make the first course more relevant to the demands of Big Data. The traditional introductory course (including AP) is based on the assumption data are scarce, and the main problem is to avoid mistaking small, chance effects for real effects. That is not as serious an issue when data are plentiful, so much of the traditional introductory course appears irrelevant in that situation. Indeed, statisticians regularly complain that workers in Big Data fail to see the relevance of statistics. Small wonder if that relevance is invisible in the first course!

Resampling in the High Schools

Resampling has been creeping into the high schools in recent years. Although no resampling methods appear in the current AP Statistics syllabus, some teachers discuss them for pedagogical reasons, and two examples can be found in the latest edition of one AP text. Some high schools use the Tabor/Franklin resampling-based text on statistics in sports for a non-AP statistics course. Surprisingly, similar methods also now appear in the more-elementary CCSS:

“Make inferences and justify conclusions from sample surveys, experiments, and observational studies.

“CCSS.Math.Content.HSS-IC.B.3 Recognize the purposes of and differences among sample surveys, experiments, and observational studies; explain how randomization relates to each.

“CCSS.Math.Content.HSS-IC.B.4 Use data from a sample survey to estimate a population mean or proportion; develop a margin of error through the use of simulation models for random sampling.

“CCSS.Math.Content.HSS-IC.B.5 Use data from a randomized experiment to compare two treatments; use simulations to decide if differences between parameters are significant.”

It is worth noting that confidence intervals are not mentioned and hypothesis tests are mentioned only for experiments. There is no mention of t-tests and the like; instead, the word “simulation” appears. The guidelines might equally have said “bootstrap” or “permutation test,” but those terms would not be familiar to the intended audience. For surveys, many statisticians would see the bootstrap as the obvious “simulation” for standard IC.B.4.

In the Common Core, inference for surveys stops short with a margin of error. That is consistent with the fact that the bootstrap can describe sampling variability, but its confidence intervals often are not at the claimed level—certainly not for the simple bootstrap and small samples usually used in high-school classes.

Some textbooks on survey sampling (e.g., Scheafer, et al.) ignore the distinction between the normal and Student t distributions (as well as degrees of freedom) and simply double the standard error to get the margin of error. The bootstrap standard deviation would provide a reasonable simulation-based proxy as an alternative to using the usual standard error formula (which is simple to compute, but hard to justify to beginners). Doubling this gives a simulation-based “margin of error” that describes sampling variability. We simply have to avoid attaching a specific confidence level, such as 95%, when samples are small. Many people find this approach to sampling variability a more intuitive introduction to sampling variability than the traditional approach through theory. It can even be used in a course where the mathematical approach is also taught—a policy that would move AP forward without making it less attractive as a substitute for traditional courses.

The obvious simulation for significance testing for experiments is a permutation test using a sample from all possible rerandomizations. The fact that nearly all AP Statistics and college introductory statistics textbooks develop inference methods based on a sampling model, and then apply them with little justification to inference for experiments, leaves a serious conceptual gap in the course. A permutation test is the gold standard for inference from experiments and fills a conceptual hole in most first courses. Such tests would certainly fulfill the Common Core’s demand for significance testing on experimental data using a “simulation” (see sidebar).

We are only now seeing signs of this in AP texts. The latest version of one explains that sampling methods are just large sample approximations with data for experiments, while the latest version of another text includes two examples of permutation tests to explain the basis of inference for experiments. Again, this is an update to AP that does not conflict with its covering the topics in a traditional course—in fact, it enhances the coverage of those topics.

It certainly seems possible that the Common Core will result in teaching resampling methods to more students than will ever take AP Statistics or a college statistics course. We will have to see if everyone (or anyone) sees the bootstrap or permutation tests as relevant simulations for the Common Core, the Common Core gets implemented in fact rather than in name only, or the Common Core gets thrown out for political reasons.

There is also the unfortunate fact that very few high-school teachers ever saw resampling methods in the statistics courses they took (or did they?) in college. In any case, to the extent that the above-quoted sections of the Common Core get implemented, those topics will find their way into teacher workshops. AP Statistics teachers are already becoming consultants to other K–12 mathematics teachers, and thus have to learn about such simulations. Having done so, they may introduce these into their AP courses, as some have already done.

Permutation Tests

Permutation tests go back at least to R. A. Fisher in 1936. They apply most directly to hypothesis testing for experiments. Imagine a simple experiment in which 10 randomly selected subjects receive a treatment and 10 others do not. The null hypothesis amounts to saying that any difference we might see in outcomes is attributable only to the randomization process. Provisionally assuming this null, we generate the distribution of differences that would arise from reassigning the given data to the two groups at random.

As with methods based on distributional assumptions and tables, we reject the null if the actual outcome is in the tails of the distribution. Fisher rarely actually used this method. Even for our simple example, there are 184,756 possible reassignments of the 20 values to consider. Fisher regarded the usual methods based on tables, such as those for the Student t or normal distributions, as convenient approximations that avoided the massive computational effort involved in permutation tests.

Today computers can do those computations for AP-sized batches of numbers. For larger data sets, we can take a large sample of the possible rearrangements. In 2004, Ernst suggested using permutation tests for the college introductory course; these are the methods of choice for inference in Tabor and Franklin’s high-school text on statistics in sports.

Statisticians familiar with the history of their subject generally consider permutation tests to be the gold standard for data from experiments using random assignment. Users of statistics, however, have often never heard of permutation tests, and may even question whether they are as accurate or valid as the more-common approximations. There is also the question of how relevant such tests are to the analysis of data gathered from surveys rather than experiments.

Big Data in the Colleges

Big Data refers to the analysis of huge observational data sets. Opinion varies on how big is “big.” For purposes of a first course, the relevant cut-off is low: It is at the point where we have enough data that it makes sense to use part of the data to develop a model and another part to test it. At that point, most of the traditional inference procedures—and the long build-up about sampling distributions and the Central Limit Theorem—become much less relevant.

Perhaps as important as size is the fact that most Big Data are observational. AP Statistics gives top billing to surveys, with some attention to experiments, but little attention to observational studies. Finally, with Big Data, we usually are seeking a model from the data. In AP and most introductory courses, we are usually testing models given by the textbook. Thus, Big Data requires a very different skill set.

On the other hand, Big Data skills such as data exploration, modeling, and working with observational data are useful to people not working with Big Data, which gives these skills some claim to a place in a first course. Once again, this need not conflict with AP substituting for a traditional course.

These issues have had little impact on the high schools so far, but may have an effect on college courses as techniques for Big Data become more common. Efforts to include such techniques in a first college course are currently at the individual instructor level, but there are reasons to think this might create change much more rapidly than resampling methods did. For one thing, there is little doubt among statisticians of the validity or importance of Big Data techniques. In addition, there are strong self-interest motivations here as statisticians see computer scientists become the leaders in the field of Big Data. This issue is raised frequently in editorials in publications aimed at statisticians, yet the first course does little to appeal to those interested in Big Data.

In fact, traditional textbooks (including AP) are usually written with the (raw) data for examples and exercises involving a number of observations small enough that the computations might be done with a pocket calculator. (Presumably this is because publishers do not want to lose sales to any course using calculators as the technology of choice.) Usually, students are asked only to carry out the technique explained in the section they just read, not to explore the data using all the techniques they have learned so far. Exploring possible relationships between variables—usually the main point of interest in real research—is mostly limited to making a scatter plot or regression analysis for the only two variables provided.

These are long-standing criticisms of the first course, but Big Data may provide additional incentives to address them. If we can show students how to explore an HTML table of data that interests them that they find on the Internet, we can vastly increase the amount of data available for students to explore. Using the student’s data set rather than the teacher’s or the textbook’s challenges the student to formulate hypotheses and explore them. Along the way, the student learns the important skill of matching the tool to the job (rather than using the tool the textbook asked for).

Beyond human inertia, the biggest impediment to using both resampling and Big Data in high-school courses is technology. AP Statistics students normally use graphing calculators that automate some of the work done with tiny data sets, but are not designed for doing enough repetitions to generate a plausible permutation test or bootstrap estimate, or even to enter Big Data.

There has been one small step toward computer use since the AP course began: Schools now have to submit detailed course descriptions to the College Board before they may legally label a course “AP” on a transcript. As part of this process, schools must describe how computers will be integrated into the course. At present, though, such use seems to be mainly demonstrations by the teacher (or an applet), rather than lots of hands-on data analysis by the student. While many schools are short of money for computers, and space to put computers, statistical software is not very demanding. Much of it will run on machines so old that people are paying recycling centers to dispose of them.

Still, change is not impossible. The textbooks by David Moore revolutionized college introductory courses in the 1980s. After a slow start, resampling methods are appearing in more and more college and high school courses. In 2007, George W. Cobb, one of the most influential statistics educators of our time, suggested that resampling methods replace traditional methods in the first course. And while response to Big Data in that first course is just getting off the ground, that change may have much more energy behind it as statisticians see more and more of their students conclude that statistics is irrelevant to Big Data. At the same time, we have to recognize that some of the needs of Big Data differ from the needs of students seeking research skills in traditional fields of application, or a more informed citizenship. We need to look for ways in which attending to Big Data can improve the first course for everyone.

What’s Next?

One way to compare college and high-school statistics, both now and for the future, is to look at how key steps in a statistical study are handled. The first step is to design the study. The average college course largely ignores this step, with the expected result that students do not learn to design their studies. AP Statistics has always emphasized design, and been comparable here to the best college courses.

That emphasis has increased with the years, and we now see more emphasis on experiments compared to surveys. Observational studies are still relatively neglected, which is doubly unfortunate, since nearly all Big Data projects are observational. The Common Core leads the pack by giving equal weight to the three types of studies. By discussing all three, we can more clearly show the strengths and limitations of each. For example, we can talk about confounding of variables, how a well-designed experiment can control it, and how lack of control in observational studies and Big Data limits the conclusions we can draw. This illustrates another way in which Big Data can inform a first course without turning it into a data-mining course. We can talk about these limitations of inference without going into a lot of detail, and in a way that will improve the course for everyone.

We can also involve students more actively by asking them to consider various possible designs for a study not yet done, rather than just trying to determine the design of a study from a verbal description written after the study has been done.

For Big Data projects, the next step involves things like accessing the data from a database, possibly sampling from it or partitioning it, cleaning it, and the like. Here, a reasonable compromise might be for students in a first course see examples of such issues, but not spend a lot of time on them. These problems are relatively mild in the traditional surveys and experiments where the researcher makes all the decisions about data entry (but not so mild that we can ignore them as we do now). Hence, they may not be of prime importance to the students headed in traditional directions. Yet, even these students should be aware that they need to give careful thought to data entry and coding if they want to avoid analysis becoming a nightmare.

The Bootstrap

A resampling method based on random sampling is the bootstrap. Here, the idea is that we have a random sample and wish to use it to estimate how accurately it represents the population. In many cases, we can—just as with traditional methods—use a sample statistic to estimate its corresponding population parameter. Again, as with traditional methods, it is not so much the estimate that concerns us as the additional estimate of the accuracy of the initial estimate.

Traditionally, we use the standard deviation of the sampling distribution (i.e., the standard error of our sample statistic) as a typical value for sampling error. The bootstrap estimates the shape and variability of this sampling distribution by creating a distribution formed by taking resamples from the existing sample. These resamples are of the same size as the original, but are taken with replacement. We then take the standard deviation of the bootstrap distribution as an estimate of standard error, or at least as a statistic measuring sampling variability. There is little objection to the latter, but there are questions about how well the standard deviation of the bootstrap distribution estimates the standard deviation of the sampling distribution. For small samples, it underestimates this by quite a bit.

Over the years, the bootstrap has been much more controversial than permutation tests. Statisticians generally attribute its invention to Bradley Efron, who used it in certain specialized circumstances and studied its properties. The most enthusiastic proponents of the bootstrap are often influenced by Julian Simon, who advocated its use for most of the inference procedures in introductory statistics, and wrote a textbook about the method and developed software to carry out the computations. That approach was viewed as valid, but many statisticians regarded the claims as unproven.

In particular, it was claimed that the bootstrap was not subject to the limiting assumptions of traditional methods, which often work poorly for samples that are small or come from a skewed population. But, in fact, one crucial assumption was made: 95% confidence intervals between the 2.5%- and 97.5%-iles of the bootstrap distribution were constructed, rather than a normal or student t distribution, assuming that 95% of those intervals would include the true population value. Later simulation studies indicated that it was not always true. It was believed that the bootstrap would be better than traditional methods for small samples, but, in fact, it performs poorly for small samples. Hence, there are many tweaked versions of the bootstrap with better performance. These are too complex to appear in an introductory course as anything other than a black box, so they lack any of the pedagogical advantages of resampling.

This brings us to further consideration of the audiences for a first course. The introductory statistics course has long been schizophrenic about its intended audience. Probably the majority of those who take the course will never do a real study, and need only know things that will help them to evaluate research done by others. Probably even fewer will ever work with Big Data. Looking at the stages of a study as described above, a future researcher would need some of the data processing skills needed for Big Data, while an evaluator of research would need none. All student audiences need to know what constitutes good design, and the limitations on the conclusions we can draw from a poor design—or no design at all.

The next step in a study would be to look at the data. For a designed study, we hope not to see any surprises, and generally we are looking for typos, outliers, or distribution shapes markedly different from expectations. For Big Data, exploration may be the principal task. For the consumer of research, exploration of the raw data is rarely an option. In AP, students are expected to check assumptions for all inferences, and this usually requires some sort of graphical display. This is mandatory for researchers, and useful for consumers in that it communicates what needs to be done, so readers can notice whether a report gives any indication that assumptions have been checked.

The Common Core does not include any inference procedures with assumptions that need checking against the data, though it does spend lots of time looking at data. The typical college course spends little or no time on checking assumptions against the data. Both this college course and the future users of Big Data would benefit from more emphasis on exploration. This, in turn, requires data sets large enough that we can find interesting patterns and make meaningful checks of assumptions.

A neglected area of interest to those doing research is where the hypotheses we test come from. Usually they come from prior experience in exploring data in the researcher’s field. In addition, such prior experience is a much better basis than a small sample for believing that data of the sort we have gathered follow some theoretical distribution.

Finally, exploring data and formulating their own hypotheses is very motivating for students. Taken together, these issues suggest that more data exploration might be advantageous for all audiences in the college and AP courses. Possible resistance might come from departments serviced by such a course who often wish to cram in as many different hypothesis tests as possible—at the expense of anything else.

Let us turn next to what many serviced departments consider the only step worth doing—testing hypotheses and computing confidence intervals. An alert reader may have noted that the discussion above of study design has ignored resampling methods. While those have pedagogical uses here and there in the previous steps, they are primarily of interest as alternate inference techniques. However, Big Data offers a much simpler path to inference. It is common to partition a data set in two, using one part to generate hypotheses and the other part to confirm them.

This is a very simple approach to illustrating the existence of chance effects and providing a means of dealing with such effects. While this approach is not useful for small data sets, the underlying idea of statistical inference is vital to all audiences.

Can We Do It?

The Common Core State Standards for statistics are already quite ambitious. It is unlikely that they will become even more ambitious any time soon. AP Statistics is likely to change due to pressure from two directions. First, more and more high school teachers are seeing and using resampling methods, and more students may be seeing those methods due to the Common Core. (They are not, however, seeing any effects of Big Data.) Secondly, as changes take place in college courses, AP courses should reflect those changes.

A limiting factor on that second point is that the college course students are most likely to get credit for with a good AP Statistics score is a general education course taught in the mathematics department, usually not by a statistician. Many of those courses are still far behind the current AP syllabus, and may be the last to be affected by change.

Unfortunately, these are the statistics courses most likely to be taken by future high-school teachers. As long as this situation persists, there will be need for massive teacher retraining efforts if resampling and even medium-sized data can reach the high schools. Those efforts must both retrain current teachers and retrain those who will train future teachers. One possible model for workshops might have a statistician presenting the new material in the morning, then joining the teachers as a team member in the afternoon to develop new curricular material. The knowledge high-school teachers have of what is going on inside students’ heads can be eye-opening for many college faculty.

The second major obstacle is the near-total absence of regular use of computers by students in high-school statistics. Such resources are needed if we hope to do resampling, or even hint at Big Data. Technology in AP Statistics is already decades behind the times, and increased use of technology at the college level can only widen the gap. Perhaps the tipping point will come as colleges begin to stop accepting AP Statistics on grounds of obsolescence.

Conclusion

At the beginning of this article, I mentioned that some people are frustrated by the lack of change in the AP syllabus over the decades. Defenders point to limited change in the college courses AP targets. I think part of the frustration is that, at its inception, AP Statistics represented current best practices, while today it represents current average practices. The Common Core and changes in the college course may yet convince AP to lift its aim back to its original level. Attempting this, and making it work, will require lots of effort from the entire statistical community.

About the Author

After a few years in industry, Robert W. Hayden (bob@statland.org) taught mathematics at colleges and universities for 32 years and statistics for 20 years. In 2005, he retired from full-time classroom work. He now teaches statistics online at statistics.com and presents summer workshops for high-school teachers of AP Statistics.

Tagged as: ap statistics, college statistics, common core standards, statistics education