Over the last fifteen years, many states have implemented high-stakes tests as part of an effort to strengthen accountability for schools, teachers, and students. Predictably, there has been vigorous disagreement regarding the contributions of such policies to increasing test scores and, more importantly, to improving student learning. A recent study by Amrein and Berliner (2002a) has received a great deal of media attention. Employing various databases covering the period 1990-2000, the authors conclude that there is no evidence that states that implemented high-stakes tests demonstrated improved student achievement on various external measures such as performance on the SAT, ACT, AP, or NAEP. In a subsequent study in which they conducted a more extensive analysis of state policies (Amrein & Berliner, 2002b), they reach a similar conclusion. However, both their methodology and their findings have been challenged by a number of authors. In this article, we undertake an extended reanalysis of one component of Amrein and Berliner (2002a). We focus on the performance of states, over the period 1992 to 2000, on the NAEP mathematics assessments for grades 4 and 8. In particular, we compare the performance of the high-stakes testing states, as designated by Amrein and Berliner, with the performance of the remaining states (conditioning, of course, on a state’s participation in the relevant NAEP assessments). For each grade, when we examine the relative gains of states over the period, we find that the comparisons strongly favor the high-stakes testing states. Moreover, the results cannot be accounted for by differences between the two groups of states with respect to changes in percent of students excluded from NAEP over the same period. On the other hand, when we follow a particular cohort (grade 4, 1992 to grade 8, 1996 or grade 4, 1996 to grade 8, 2000), we find the comparisons slightly favor the low-stakes testing states, although the discrepancy can be partially accounted for by changes in the sets of states contributing to each comparison. In addition, we conduct a number of ancillary analyses to establish the robustness of our results, while acknowledging the tentative nature of any conclusions drawn from highly aggregated, observational data.
Scoring reliability of essays and other free-response questions is of considerable concern, especially in large, national administrations. This report describes a statistically designed experiment that was carried out in an operational setting to determine the contributions of different sources of variation to the unreliability of scoring. The experiment made novel use of partially balanced incomplete block designs that facilitated the unbiased estimation of certain main effects without requiring readers to assess the same paper several times. In addition, estimates were obtained of the improvement in reliability that results from removing variability from systematic sources of variation by an appropriate adjustment of the raw scores. This statistical calibration appears to be a cost-effective approach to enhancing scoring reliability when compared to simply increasing the number of readings per paper. The results of the experiment also provide a framework for examining other, simpler calibration strategies. One such strategy is briefly considered.The inclusion of free-response questions in large-volume examinations has proven to be a mixed blessing. On the one hand, such questions address knowledge or skills that may not be easily or plausibly assessed by multiple choice questions. The use of essay questions to measure writing skills is a good example. The difficulty arises in the scoring of such questions. Large numbers of graders must be trained and supervised and the maintenance of uniform standards across graders and over many days often becomes prob-
A longstanding issue in American education is the gap in academic achievement between majority and minority students. The goal of this study is to accumulate and evaluate evidence on the relationship between state education policies and changes in the Black-White achievement gap, while addressing some of the methodological issues that have led to differences in interpretations of earlier findings. To that end, we consider the experiences of ten states that together enroll more than forty percent of the nation's Black students. We estimate the trajectories Education Policy Analysis Archives Vol. 14 No. 8 2 of Black student and White student achievement on the NAEP 8th grade mathematics assessment over the period 1992 to 2000, and examine the achievement gap at three levels of aggregation: the state as a whole, groups of schools (strata) within a state defined by the SES level of the student population, and within schools within a stratum within a state. From 1992 to 2000, at every level of aggregation, mean achievement rose for both Black students and White students. However, for most states the achievement gaps were large and changed very little at every level of aggregation. The gaps are pervasive, profound and persistent.There is substantial heterogeneity among states in the types of policies they pursued, as well as the coherence and consistency of those policies during the period 1988-1998. We find that states' overall policy rankings (based on our review of the data) correlate moderately with their record in improving Black student achievement but are somewhat less useful in predicting their record with respect to reducing the achievement gaps. States' rankings on commitment to teacher quality correlate almost as well as did the overall policy ranking. Thus, state reform efforts are a blunt tool, but a tool nonetheless.Our findings are consistent with the following recommendations: states' reform efforts should be built on broad-based support and buffered as much as possible from changes in budgets and politics; states should employ the full set of policy levers at their disposal; and policies should directly support local reform efforts with proven effectiveness in addressing the experiences of students of different races attending the same schools.
Graduate education in the United States is characterized by an enormous diversity of disciplines and the predominance of relatively small enrollments in individual departments. In this setting, a validity study based on a single department's data and employing classical statistical methods can only be of limited utility and applicability. In order to participate in the Graduate Record Examinations Validity Study Service, a department must have at least 25 students in its entering class. Only validities for single predictors are provided; estimates of the validity of two or more predictors, used jointly, are considered too unreliable because the corresponding prediction equations often possess implausible characteristic, such as negative coefficients. These constraints were introduced by the Validity Study Service to reduce the chance that the results in the report to a department would be overly influenced by statistical artifacts in the data and hence serve more to mislead than to inform. In this study we investigated two statistical methods, empirical Bayes and cluster analysis, to determine whether their application to the problems faced by the Validity Study Service could result in useful improvements. Considerable effort was expended in developing and studying a new and more general class of empirical Bayes models that can accommodate the complex structure of the Validity Study Service data base. The principal methodological conclusions of this study are that, through the use of a new class of empirical Bayes methods, it is possible to obtain, at the departmental level, useful and reliable estimates of the joint validity of several predictors of academic performance and that these methods may be further refined to address the question of differential predictive validity, again at the departmental level. These results have important practical implications for the GRE Validity Study Service.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.