Musing About Changes in the SAT: Is the College Board Getting Rid of the Bulldog?
During the first week of March 2014, the front page of most U.S. newspapers reported a story released by The College Board about big changes planned for the SAT. These were amplified and augmented by the cover story of the March 9 issue of The New York Times’ Sunday Magazine.
Having spent 21 years as an employee of the Educational Testing Service—the College Board’s principal contractor in the production, administration, and scoring of the SAT—I read these reports with great interest. In the end, I was left wondering why these changes in particular were being proposed and, moreover, why all the hoopla. Let me elaborate. There were three principal changes described:
- In scoring, they would no longer have a penalty for guessing.
- They would reduce the amount of arcane vocabulary used on the test.
- They would discontinue the Writing part of the exam and return to just the canonical Verbal and Quantitative sections, leaving Writing as a stand alone that would be optional.
By my reckoning, the first change is likely to have only a small effect, but it probably will increase the error variance of the test, hence making the scores a little less accurate. The second is addressing something that isn’t really much of a problem, but it is always wise to be vigilant about including item characteristics that are unrelated to the trait being measured. I am not sanguine about any changes being implemented successfully. The third is probably a face-saving reaction to the 2005 introduction of the Writing section that did not work as hoped and is the one modification that is likely to yield a positive effect.
To see how I arrived at these conclusions, let’s go over each of the changes more slowly.
No Penalty for Guessing
The current SAT uses what is called “formula scoring.” That is the score that an examinee gets is equal to the number right diminished by one-fourth the number wrong (for five-choice items; for k-choice items, it would be 1/k-1 of the number wrong). The idea behind this is that if examinees guess completely at random among the choices on the items for which they don’t know the answer, for every four (or k-1) items they get wrong, they will, on average, get one right by chance. So, under these circumstances, the expected gain from guessing is zero. Thus, there is neither any benefit to guessing nor does guessing add bias to the scores, although there is the binomial variance due to guessing that is unnecessarily added to the score. Note that if an examinee has partial knowledge and can eliminate one or more of the distractors (the wrong answers), the expected gain from guessing—even after correction—is positive, thus giving some credit for such partial knowledge. What is being proposed is to do away with this correction and use simply the unadjusted number-right as the input into the scoring algorithm.
What is likely to be the effect? My memory is that the correlation between formula score and number correct score is very close to one. So whatever change occurs will probably be small. But, maybe not, for it seems plausible that using formula scoring deters at least some random guessing. Such guessing adds no information to the score, just noise, so it is hard to make a coherent argument for why we would want to encourage it. But perhaps it was decided that if the effect of making the change is small, why not do it—perhaps it would make it look like the College Board was being responsive to critics without making any real change.
Reduce the Amount of Arcane Vocabulary
This modification has been the subject of considerable discussion (see Murphy’s December 2013 Atlantic article), but the meaning of the term arcane—within this context—remains shrouded in mystery. Let us begin with the following dictionary definition:
arcane (adjective) — known or understood by very few; mysterious; secret; obscure; esoteric:
She knew a lot about Sanskrit grammar and other arcane matters.
Arcane words, defined in this way, brings to mind obscure words used within very narrow contexts, such as chukka or cwm. The first is one of the 7.5-minute periods that make up a polo match and is rumored to have been last used as part of an SAT more than 60 years ago. The second derives from the Welsh word for valley whose principal use is in the closing plays of Scrabble games.
But this does not seem to characterize what critics of SAT vocabulary have in mind. An even dozen words that have been used to illustrate this “flaw” are, in alphabetical order: artifice, baroque, concomitant, demagogues, despotism, illiberal, meretricious, obsequious, recondite, specious, transform, and unscrupulous.
Words of this character are less often heard in common conversation than they are read. Thus, I think a better term for such words is not arcane, but rather “literary.” Why we would want to rid the SAT of the lexical richness of words accumulated through broad reading is a question that seems hard to justify. I will not try. Instead, let me parse the topic into what I see as its three component parts.
- (i) How much arcane vocabulary is there on the SAT? I suspect that using the true definition of arcane, there is close to none. Using my modified definition of literary vocabulary, there is likely some, but with the promised shift to including more “foundational” documents on the test (e.g., Declaration of Independence, Federalist Papers), it seems unavoidable that certain kinds of literary, if not arcane, vocabulary will show up. In an introductory paragraph of Alexander Hamilton’s Federalist #1 General Introduction (see sidebar), I found a fair number of my illustrative dozen (indicated in boldface).
- (ii) Is supporting the enrichment of language with unusual words necessarily a bad thing? I found Hamilton’s “General Introduction” to be lucid and well argued. Was it so in spite of his vocabulary? Or because of it? James Murphy, in his December 2013 The Atlantic article, “The Case for SAT Words,” argues persuasively in support of enrichment. I tend to agree.
- (iii) But, it may be that there are still some pointlessly arcane words on the SAT that rarely appear anywhere else than on the SAT (akin to such words as “busker,” which has appeared in my writing for the first time today). If such vocabulary is actually on the SAT, how did it find its way there? There are probably many answers to this question, but I believe they all share a common root. Consider a typical item on the Verbal section of the SAT (or any other verbal exam), say a verbal reasoning or verbal analogy item. The difficulty of the item varies with the subtlety of the reasoning or the complexity of the analogy. It is a sad, but inexorable, fact about test construction that item writers cannot write items that are more difficult than they are smart. And so the distribution of item difficulties looks a lot like the distribution of item writer ability. But, to make discriminations among candidates at high levels of ability, test specifications require a fair number of difficult items. How is the item writer going to respond when her supervisor tells her to write 10 hard items? Often, the only way to generate such items is to dig into a thesaurus and insert words that are outside of broad usage. This will yield the result that fewer people will get them right (the very definition of “harder”).
Clearly, the inclusion of such vocabulary is not directly related to the trait being tested (e.g., verbal reasoning), any more than is making a writing task harder by insisting examinees hold the pen between their toes. And so, getting rid of such vocabulary may be a good idea. I applaud it. But how then will difficult verbal items be generated? One possibility is to hire much smarter item writers, but such people are not easy to find—nor are they cheap. But the College Board’s plan may work for a while, as long as unemployment among Ivy League English and classics majors is high. As the job market improves, such talent will become rarer, thus so long as the need for difficult verbal items remains, I fear we may see the inexorable seepage of a few arcane words back onto the test. But with all the checks and edits a prospective SAT item must negotiate, I don’t expect to see many.
Making the Writing Portion Optional
To discuss this topic fully, we need to review the purposes of a test. There are at least three:
- Test as contest – the higher score wins (gets admitted, gets the job, etc.). For this purpose to be served, the only characteristic a test must have is fairness.
- Test as measuring instrument – the outcome of the test is used for further treatment (determination of placement in courses, the measurement of the success of instruction, etc.). For this purpose, the test score must be accurate enough for the applications envisioned.
- Test as prod – Why are you studying? I have a test. Or, more particularly, why does the instructor insist on students writing essays? Because they will need to write on the test. For this purpose, the test doesn’t even have to be scored, although that practice would not be sustainable.
With these purposes in mind, why was the writing portion added to the core SAT in 2005? I don’t know. I suspect it was for a combination of reasons, but principally (c), as a prod. Why a prod, and not for other purposes? Scoring essays, because of its inherent subjectivity, is a difficult task on which to obtain much uniformity of opinion. More than a century ago, it was found that there was as much variability among scorers of a single essay as there was across all the essays seen by a single scorer. After uncovering this disturbing result, the conclusion reached by the examiners was that scorers needed to be better trained. Almost 25 years ago, a more modern analysis of a California writing test found that the variance component due to raters was the same as that due to examinees. So, 100 years of experience training essay raters didn’t help.
A study done in the mid-1990s used a test made up of three 30-minute sections. Two sections were essays and one was a multiple-choice test of verbal ability. The multiple-choice score correlated more highly with each of the essay scores than the essay scores did with one another. What this means is that if you want to predict how an examinee will do on some future writing task, you can do so more accurately with a multiple-choice test than a writing test.
Thus, I conclude that the Writing section must have been included primarily as a prod so teachers would emphasize writing as part of their curriculum. Of course, the writing section also included a multiple-choice portion (that was allocated 35 of the 60 minutes for the whole section) to boost the reliability of the scores to something approaching acceptable levels.
The Writing section also has been the target of a fair amount of criticism from teachers of writing who claimed, credibly, at least to me, that allocating 25 minutes yielded no real measure of a student’s ability. This view was supported by the sorts of canned general essays coaching schools had their students memorize. Such essays contained the key elements of a high-scoring essay (400 words long, three quotes from famous people, seven complex words, and the suitable insertion of some of the words in the ‘prompt’ that instigated the essay).
Which brings us to the point at which we can examine what has changed to cause the College Board to reverse field and start its removal. I suspect at least part of the reason is that it was an unrealistic task, expensive to administer and score, that yielded an unreliable measure of little value.
Making it optional, as well as scoring it on a different scale than the rest of the SAT, is perhaps the College Board’s way of getting rid of it gracefully and gradually. Based on the resources planned for the continued development and scoring of this section, it appears the College Board is guessing few colleges will require it and few students will elect to take it.
The SAT has been in existence, in one form or another, since 1926. Its character was not arrived at by whim. There is strong evidence, accumulated over those nine decades, that supports many of the decisions made in its construction. But, it is not carved in stone and changes have occurred continuously. However, those changes were small, inserted with the hope of making an improvement if they work and not being too disastrous if they do not. This follows the best advice of experts in quality control and has served the College Board well. The current changes fall within these same limits. They are likely to make only a small difference, but the difference will be a positive one with luck. The most likely place for an improvement is the shrinkage of the Writing section. The other two changes appear to be largely cosmetic and not likely to have any profound effect.
Why were they included in the mix? Some insight into this question is provided by recalling a conversation in the late 1960s between John Kemeny and Kingman Brewster, the presidents of Dartmouth and Yale, respectively. Dartmouth had just gone co-ed and successfully avoided the ire of the inevitable alumni who tend to oppose any changes. Yale was about to undergo the same change and so Brewster asked Kemeny if he had any advice. Kemeny replied, “Get rid of the bulldog.”
At the same time that Dartmouth made the enrollment change, they also switched their mascot from the Dartmouth Indian to the Big Green. Alumni apparently were so up in arms about the change in mascot that they hardly noticed the girls. By the time they did, it was a fait accompli (and they then noticed that they could now send their daughters to Dartmouth and were content).
Could it be that the College Board’s announced changes vis-a-vis guessing and arcane vocabulary were merely the bulldog they planned to use to distract attention from the reversal of opinion represented by the diminution of importance of the Writing section? Judging from the reaction in the media to the College Board’s announcement, this seems a plausible conclusion.
Balf, T. 2014. The SAT is hated by—all of the above. New York Times Sunday Magazine, 26-31, 48-49.
Bock, R. D. 1991. The California assessment. A talk given at the Educational Testing Service, Princeton, NJ, on June 17, 1991.
Murphy, J. S. 2013. The case for SAT words. The Atlantic.
Wainer, H. 2011. Uneducated guesses using evidence to uncover misguided education policies. Princeton, NJ: Princeton University Press.
Wainer, H., R. Lukele, and D. Thissen. 1994. On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement 31:234–250.
About the Author
Howard Wainer is currently distinguished research scientist at the National Board of Medical Examiners. He has won numerous awards and is a Fellow of the American Statistical Association and the American Educational Research Association. His interests include the use of graphical methods for data analysis and communication, robust statistical methodology, and the development and application of generalizations of item response theory. He has published more than 20 books; his latest is Medical Illuminations: Using Evidence, Visualization, and Statistical Thinking to Improve Healthcare (Oxford University Press, 2014).
Visual Revelations covers many topics, but generally focuses on two principal themes: graphical display and history. Howard Wainer, column editor, encourages using this column as an outlet for popular statistical discourse. If you have questions or comments about the column, please contact Wainer at firstname.lastname@example.org.