This study explores alternative ways of reducing the number of variables/features and additional ways of combining information across features to produce more stable and accurate e-rater ® scores. Following an explanation of the statistical aspects of the process is a description of alternatives to the process. Our explorations resulted in certain conclusions and directions for future research. We have examined enough e-rater data to conclude that stepwise regression seems to be effective as a feature reduction procedure. However, this may be attributed to the consistently strong relationship with essay score that is observed for the content vector analysis (CVA) variables and the two variables used to approximate word length (number of auxiliary verbs and the ratio of the number of auxiliary verbs to the number of words). To yield better validation results, we also suggest that the hold-out method for evaluating validity should replace the current two-stage approach of first developing a model in a quasi-uniform training sample and then validating these results in a target cross-validation sample. More research is needed in several areas. First, explicit modeling of the part of essay scores that is unrelated to word length is warranted. The POM (Proportional Odds Model) approach should be investigated in greater depth. Also needed is a statistical justification for using essay scores to score CVA variables.Algorithmic approaches to prediction/classification problem, such as boosting, may prove fruitful. Further investigation of quantile regression and ridge regression should be conducted.
Traditionally, the fixed‐length linear paper‐and‐pencil (P&P) mode of administration has been the standard method of test delivery. With the advancement of technology, however, the popularity of administering tests using adaptive methods like computerized adaptive testing (CAT) and multistage testing (MST) has grown in the field of measurement in both theory and practice. In practice, several standardized tests have sections that include only set‐based items. To date, there is no study in the literature that compares these testing procedures when a test is completely set‐based under various item response theory (IRT) models. This study investigates the measurement precision of MST compared to CAT and compared to P&P tests for the one‐, two‐, and three‐parameter logistic (1‐, 2‐, and 3PL) models when the test is completely set‐based. Results showed that MST performed better for the 2‐ and 3PL models than an equivalent‐length P&P test in terms of reliability and conditional standard error of measurement. In addition, findings showed that MST performed better for the 1‐ and 2PL models than for an equivalent‐length CAT test. For the 3PL model, MST and CAT performed about the same.
Observed proportion agreement as a measure of association between two ratings of essay performance can be inflated when the number of rating categories is small. Cohen's Kappa adjusts observed agreement by subtracting out what one might expect if ratings were assigned independently of each other. The matrix of proportion agreements between two sets of assignment rules can be recast as a confusion matrix in which zero confusion is the equivalent of perfect agreement. Kappa can be viewed then as a measure of confusion reduction. A complementary measure, confusion infusion is defined. Its usefulness is illustrated with live data from a large-scale testing program where e-rater ® , an automatic essay-scoring algorithm, is used in place of a second reader. The confusion reduction and confusion infusion indices help make comparisons among the relative efficacy of two versions of e-rater, and two other methods of assigning scores, a second reader and assigning all candidates the mode of the first reading.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.