Abstract:E-rater® has been used by the Educational Testing Service for automated essay scoring since 1999. This paper describes a new version of e-rater (V.2) that is different from other automated essay scoring systems in several important respects. The main innovations of e-rater V.2 are a small, intuitive, and meaningful set of features used for scoring; a single scoring model and standards can be used across all prompts of an assessment; modeling procedures that are transparent and flexible, and can be based entirely on expert judgment. The paper describes this new system and presents evidence on the validity and reliability of its scores.
In this article, the authors show that test makers and test takers have a strong and systematic tendency for hiding correct answers—or, respectively, for seeking them—in middle positions. In single, isolated questions, both prefer middle positions to extreme ones in a ratio of up to 3 or 4 to 1. Because test makers routinely, deliberately, and excessively balance the answer key of operational tests, middle bias almost, though not quite, disappears in those keys. Examinees taking real tests also produce answer sequences that are more balanced than their single question tendencies but less balanced than the correct key. In a typical four‐choice test, about 55% of erroneous answers are in the two central positions. The authors show that this bias is large enough to have real psychometric consequences, as questions with middle correct answers are easier and less discriminating than questions with extreme correct answers, a fact of which some implications are explored.
Educational assessment applications, as well as other natural-language interfaces, need some mechanism for validating user responses. If the input provided to the system is infelicitous or uncooperative, the proper response may be to simply reject it, to route it to a bin for special processing, or to ask the user to modify the input. If problematic user input is instead handled as if it were the system's normal input, this may degrade users' confidence in the software, or suggest ways in which they might try to "game" the system. Our specific task in this domain is the identification of student essays which are "off-topic", or not written to the test question topic. Identification of off-topic essays is of great importance for the commercial essay evaluation system Criterion SM . The previous methods used for this task required 200-300 human scored essays for training purposes. However, there are situations in which no essays are available for training, such as when users (teachers) wish to spontaneously write a new topic for their students. For these kinds of cases, we need a system that works reliably without training data. This paper describes an algorithm that detects when a student's essay is off-topic without requiring a set of topic-specific essays for training. This new system is comparable in performance to previous models which require topic-specific essays for training, and provides more detailed information about the way in which an essay diverges from the requested essay topic.
This study examined the construct validity of the e-rater ® automated essay scoring engine as an alternative to human scoring in the context of TOEFL ® essay writing. Analyses were based on a sample of students who repeated the TOEFL within a short time period. Two e-rater scores were investigated in this study, the first based on optimally predicting the human essay score and the second based on equal weights for the different features of e-rater.Within a multitrait-multimethod approach, the correlations and reliabilities of human and e-rater scores were analyzed together with TOEFL subscores (structured writing, reading, and listening) and with essay length. Possible biases between human and e-rater scores were examined with respect to differences in performance across countries of origin and differences in difficulty across prompts. Finally, a factor analysis was conducted on the e-rater features to investigate the interpretability of their internal structure and determine which of the two e-rater scores reflects this structure more closely.Results showed that the e-rater score based on optimally predicting the human score measures essentially the same construct as human-based essay scores with significantly higher reliability and consequently higher correlations with related language scores. The equal-weights e-rater score showed the same high reliability but significantly lower correlation with essay length. It is also aligned with the 3-factor hierarchical (word use, grammar, and discourse) structure that was discovered in the factor analysis. Both e-rater scores also successfully replicate human score differences between countries and prompts.
Although "hot hands" in basketball are illusory, the belief in them is so robust that it not only has sparked many debates but may also affect the behavior of players and coaches. On the basis of an entire National Basketball Association season's worth of data, the research reported here shows that even a single successful shot suffices to increase a player's likelihood of taking the next team shot, increase the average distance from which this next shot is taken, decrease the probability that this next shot is successful, and decrease the probability that the coach will replace the player.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.