The e‐rater® automated scoring engine used at Educational Testing Service (ETS) scores the writing quality of essays. In the current practice, e‐rater scores are generated via a multiple linear regression (MLR) model as a linear combination of various features evaluated for each essay and human scores as the outcome variable. This study evaluates alternative scoring models based on several additional machine learning algorithms, including support vector machines (SVM), random forests (RF), and k‐nearest neighbor regression (k‐NN). The results suggest that models based on the SVM algorithm outperform MLR models in predicting human scores. Specifically, SVM‐based models yielded the highest agreement between human and e‐rater scores. Furthermore, compared with MLR, SVM‐based models improved the agreement between human and e‐rater scores at the ends of the score scale. In addition, the high correlation between SVM‐based e‐rater scores with external measures such as examinee's scores on the other parts of the test provided some validity evidence for SVM‐based e‐rater scores. Future research is encouraged to explore the generalizability of these findings.
The validity of studies investigating interventions to enhance fluid intelligence (Gf) depends on the adequacy of the Gf measures administered. Such studies have yielded mixed results, with a suggestion that Gf measurement issues may be partly responsible. The purpose of this study was to develop a Gf test battery comprising tests meeting the following criteria: (a) strong construct validity evidence, based on prior research; (b) reliable and sensitive to change; (c) varying in item types and content; (d) producing parallel tests, so that pretest-posttest comparisons could be made; (e) appropriate time limits; (f) unidimensional, to facilitate interpretation; and (g) appropriate in difficulty for a high-ability population, to detect change. A battery comprising letter, number, and figure series and figural matrix item types was developed and evaluated in three large-N studies (N = 3,067, 2,511, and 801, respectively). Items were generated algorithmically on the basis of proven item models from the literature, to achieve high reliability at the targeted difficulty levels. An item response theory approach was used to calibrate the items in the first two studies and to establish conditional reliability targets for the tests and the battery. On the basis of those calibrations, fixed parallel forms were assembled for the third study, using linear programming methods. Analyses showed that the tests and test battery achieved the proposed criteria. We suggest that the battery as constructed is a promising tool for measuring the effectiveness of cognitive enhancement interventions, and that its algorithmic item construction enables tailoring the battery to different difficulty targets, for even wider applications. Keywords Intelligence. Fluid ability. Gf. Working memory training. Reasoning. Item-response theory. Test assembly General fluid ability (Gf) is Bat the core of what is normally meant by intelligence^(Carroll, 1993, p. 196), and has been shown empirically to be synonymous with general cognitive ability (g), at least within groups with roughly comparable opportunities to learn (Valentin Kvist & Gustafsson, 2008). Gf has been viewed as an essential determinant of one's ability to solve a wide range of novel real-world problems (Schneider & McGrew, 2012). Perhaps because of its association with diverse outcomes, there has been a longstanding interest in improving Gf (i.e., intelligence) through general schooling
The tasks of automatically scoring either textual or algebraic responses to mathematical questions have both been well-studied, albeit separately. In this paper we propose a method for automatically scoring responses that contain both text and algebraic expressions. Our method not only achieves high agreement with human raters, but also links explicitly to the scoring rubric -essentially providing explainable models and a way to potentially provide feedback to students in the future.
Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service.Find other ETS-published reports by searching the ETS ReSEARCHER database at http://search.ets.org/researcher/ To obtain a copy of an ETS research report, please visit AbstractThe m-rater scoring engine has been used successfully for the past several years to score CBAL™ mathematics tasks, for the most part without the need for human scoring. During this time, various improvements to m-rater and its scoring keys have been implemented in response to specific CBAL needs. In 2012, with the general move toward creating innovative tasks for the Common Core assessment initiatives, in traditional testing programs, and with potential outside clients, and to further support CBAL, m-rater was enhanced in ways that move ETS's automated scoring capabilities forward and that provide needed functionality for CBAL: (a) the numeric equivalence scoring engine was augmented with an open-source computer algebra system; (b) a design flaw in the graph editor, affecting the way the editor graphs smooth functions, was corrected; (c) the graph editor was modified to give assessment specialists the option of requiring examinees to set the viewing window; and (d) m-rater advisories were implemented in situations in which m-rater either cannot score a response or may provide the wrong score. In addition, 2 m-rater scoring models were built that presented some new challenges.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.