Test scores are commonly reported in a small number of ordered categories. Examples of such reporting include state accountability testing, Advanced Placement tests, and English proficiency tests. This paper introduces and evaluates methods for estimating achievement gaps on a familiar standard-deviation-unit metric using data from these ordered categories alone. These methods hold two practical advantages over alternative achievement gap metrics. First, they require only categorical proficiency data, which are often available where means and standard deviations are not. Second, they result in gap estimates that are invariant to score scale transformations, providing a stronger basis for achievement gap comparisons over time and across jurisdictions. We find three candidate estimation methods that recover full-distribution gap estimates well when only censored data are available. Researchers selecting an achievement gap metric face three issues. First, average-based gaps-effect sizes or simple differences in averages-are variable under plausible transformations of the test score scale (Ho, 2007; Reardon, 2008a;Seltzer, Frank, & Bryk, 1994;Spencer, 1983). Second, gaps based on percentages above a cut score, such as differences in "proficiency" or passing rates, vary substantially under alternative cut scores (Ho, 2008;Holland, 2002). Third, researchers often face a practical challenge: Although they may wish to use an average-based gap metric, the necessary data may be unavailable.This last situation has become common even as the reporting requirements of the No Child Left Behind Act (NCLB) have led to large amounts of easily accessible test score data.The emphasis of NCLB on measuring proficiency rates over average achievement has led states and districts to report "censored data": test score results in terms of categorical achievement levels, typically given labels like "below basic," "basic," "proficient," and "advanced.
Traditional Achievement Gap Measures and Their ShortcomingsA test score gap is a statistic describing the difference between two distributions.Typically, the target of inference is the difference between central tendencies. Three "traditional" gap metrics dominate this practice of gap reporting. The first is the test score scale,where gaps are most often expressed as a difference in group averages. For a student test score, ܺ, a typically higher scoring reference group, ܽ, and a typically lower scoring focal group, ܾ, the difference in averages, ݀ ௩ , follows:The second traditional metric expresses the gap in terms of standard deviation units. This metric allows for standardized interpretations when the test score scale is unfamiliar and affords aggregation and comparison across tests with differing score scales (Hedges & Olkin, 1985).Sometimes described as Cohen's ݀, this effect size expresses ݀ ௩ in terms of a quadratic average of both groups' standard deviations, ݏ and ݏ . Although a weighted average of variances or a single standard deviation could also be used in the denomina...