Abstract:E-rater® has been used by the Educational Testing Service for automated essay scoring since 1999. This paper describes a new version of e-rater (V.2) that is different from other automated essay scoring systems in several important respects. The main innovations of e-rater V.2 are a small, intuitive, and meaningful set of features used for scoring; a single scoring model and standards can be used across all prompts of an assessment; modeling procedures that are transparent and flexible, and can be based entirely on expert judgment. The paper describes this new system and presents evidence on the validity and reliability of its scores.
A utomated essay-scoring technologies can enhance both large-scale assessment and classroom instruction. Essay evaluation software not only numerically rates essays but also analyzes grammar, usage, mechanics, and discourse structure. 1,2 In the classroom, such applications can supplement traditional instruction by giving students automated feedback that helps them revise their work and ultimately improve their writing skills. These applications also address educational researchers' interest in individualized instruction. Specifically, feedback that refers explicitly to students' own writing is more effective than general feedback. 3 Our discourse analysis software, which is embedded in Criterion (www.etstechnologies.com), an online essay evaluation application, uses machine learning to identify discourse elements in student essays. The system makes decisions that exemplify how teachers perform this task. For instance, when grading student essays, teachers comment on the discourse structure. Teachers might explicitly state that the essay lacks a thesis statement or that an essay's single main idea has insufficient support. Training the systems to model this behavior requires human judges to annotate a data sample of student essays. The annotation schema reflects the highly structured discourse of genres such as persuasive writing.Our discourse analysis system uses a voting algorithm that takes into account the discourse labeling decisions of three independent systems. The three systems employ natural language processing methods to extract essay-based features that help predict the discourse labels. They also use machine learning to classify the sentences in an essay as particular discourse elements. Our tool automatically labels discourse elements in student essays written on any topic and across writing genres. Essay-based discourseResearchers have proposed a variety of discourse analysis schemes to capture the semantics of multisentence texts. Some schemes associate a hierarchical representation to a given text, while others a linear one. The representation used in our work is linear. It assumes that essays can be segmented into sequences of discourse spans and that each span is associated with an overall communicative goal. We focus on essay-specific communicative goals, which we encode using intuitive labels that are frequently used in teaching writing, such as thesis statements, main ideas, and conclusion statements. Essay annotation protocolTo facilitate development of our discourse analysis systems, two human judges annotated several hundred essays. The judges labeled elements in the essay data according to a protocol that explained how to annotate several discourse categories:• Title segments indicate essay titles.• Introductory material segments provide the context or set the stage in which the thesis, a main idea, or the conclusion is to be interpreted. • Thesis segments state the writer's position statement and are related to the essay prompt.• Main idea segments assert the author's main message in co...
The e-rater system TM ~ is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human readers, and between independent human readers and e-rater is approximately 92%. There is much interest in the larger writing community in examining the system's performance on nonnative speaker essays. This paper focuses on results of a study that show e-rater's performance on Test of Written English (TWE) essay responses written by nonnative English speakers whose native language is Chinese, Arabic, or Spanish. In addition, one small sample of the data is from US-born English speakers, and another is from non-US-born candidates who report that their native language is English. As expected, significant differences were found among the scores of the English groups and the nonnative speakers. While there were also differences between e-rater and the human readers for the various language groups, the average agreement rate was as high as operational agreement. At least four of the five features that are included in e-rater's current operational models (including discourse, topical, and syntactic features) also appear in the TWE models. This suggests that the features generalize well over a wide range of linguistic variation, as e-rater was not 1 The e-rater system TM is a trademark of Educational Testing Service. In the paper, we will refer to the e-rater system TM as e-rater. confounded by non-standard English syntactic structures or stylistic discourse structures which one might expect to be a problem for a system designed to evaluate native speaker writing.
This study examines the relation between essay length and holistic scores assigned to Test of English as a Foreign Language™ (TOEFL®) essays by e‐rater®, the automated essay scoring system developed by ETS. Results show that an early version of the system, e‐rater99, accounted for little variance in human reader scores beyond that which could be predicted by essay length. A later version of the system, e‐rater01, performs significantly better than its predecessor and is less dependent on length due to its greater reliance on measures of topical content and of complexity and diversity of vocabulary. Essay length was also examined as a possible explanation for differences in scores among examinees with native languages of Spanish, Arabic, and Japanese. Human readers and e‐rater01 show the same pattern of differences for these groups, even when effects of length are controlled.
Educational assessment applications, as well as other natural-language interfaces, need some mechanism for validating user responses. If the input provided to the system is infelicitous or uncooperative, the proper response may be to simply reject it, to route it to a bin for special processing, or to ask the user to modify the input. If problematic user input is instead handled as if it were the system's normal input, this may degrade users' confidence in the software, or suggest ways in which they might try to "game" the system. Our specific task in this domain is the identification of student essays which are "off-topic", or not written to the test question topic. Identification of off-topic essays is of great importance for the commercial essay evaluation system Criterion SM . The previous methods used for this task required 200-300 human scored essays for training purposes. However, there are situations in which no essays are available for training, such as when users (teachers) wish to spontaneously write a new topic for their students. For these kinds of cases, we need a system that works reliably without training data. This paper describes an algorithm that detects when a student's essay is off-topic without requiring a set of topic-specific essays for training. This new system is comparable in performance to previous models which require topic-specific essays for training, and provides more detailed information about the way in which an essay diverges from the requested essay topic.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.