The Grabovsky Curve

Visual Revelations

In the beginning of the 20th century, before the widespread use of objective scoring, most well-administered exams consisted of essays that were scored by multiple expert raters. A common finding was that the variation observed across raters for the same essay was about the same as that observed across examinees for the same question and the same rater. This result was not viewed with satisfaction and the conclusion drawn was that raters needed better training.

Ninety years later, Chicago’s Darrell Bock (one of the most eminent psychometricians of the second half of the 20th century), in the analysis of a California teachers’ exam, replicated the earlier finding when he reported that the variance component for raters equaled that for examinees. This result occurred despite the heroic efforts to train raters that had become de rigueur over the intervening years. These kinds of results have been widely replicated and led to the conclusion that rating essays reliably is a task of insuperable difficulty. The result also explains the wildly popular use of multiple-choice items and other item types that can be objectively scored.

Despite the ample psychometric evidence of the unreliability of subjectively scored test items, they continue to be used for many purposes and situations in the present day. In recognition of this unreliability, many credible testing organizations (most importantly, the Educational Testing Service) considered a policy of allowing examinees to request a rescoring of their exams (usually for the payment of a modest fee). Thus, if a test is composed of a fair proportion of subjectively scored items, has some passing score, and an examinee’s score is just below that score, it may be sensible for that examinee to request a rescoring in the hope that the new result falls above the passing score.

Some content is only viewable by ASA Members. Please login or become an ASA member to gain access.

Tagged as: , , ,