Evaluating an Automatically Scorable, Open‐Ended Response Type for Measuring Mathematical Reasoning in Computer‐Adaptive Tests

Bennett, Randy Elliot; Steffen, Manfred; Singley, Mark K.; Morley, Mary; Jacquemin, Daniel

doi:10.1111/j.1745-3984.1997.tb00512.x

Cited by 45 publications

(26 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The inherent variability of open-ended solutions, and lack of defined evaluation criteria for design makes automatically assessing open-ended work challenging (Bennett et al 1997). In addition, automated systems frequently cannot capture the semantic meaning of answers, which limits the feedback that they can provide to help students improve (Bennett 1998;Hearst 2000).…”

Section: The Promise Of Peer Assessmentmentioning

confidence: 99%

Peer and Self Assessment in Massive Online Classes

Kulkarni

Wei

Le³

et al. 2014

Understanding Innovation

View full text Add to dashboard Cite

Peer and self assessment offer an opportunity to scale both assessment and learning to global classrooms. This paper reports our experiences with two iterations of the first large online class to use peer and self assessment. In this class, peer grades correlated highly with staff-assigned grades. The second iteration had 42.9 % of students' grades within 5 % of the staff grade, and 65.5 % within 10 %. On average, students assessed their work 7 % higher than staff did. Students also rated peers' work from their own country 3.6 % higher than those from elsewhere. We performed three experiments to improve grading accuracy. We found that giving students feedback about their grading bias increased subsequent accuracy. We introduce short, customizable feedback snippets that cover common issues with assignments, providing students more qualitative peer feedback. Finally, we introduce a data-driven approach that highlights high-variance items for improvement. We find that rubrics that use a parallel sentence structure, unambiguous wording and well-specified dimensions have lower variance. After revising rubrics, median grading error decreased from 12.4 to 9.9 %.

show abstract

Section: The Promise Of Peer Assessmentmentioning

confidence: 99%

Peer and Self Assessment in Massive Online Classes

Kulkarni

Wei

Le³

et al. 2014

Understanding Innovation

View full text Add to dashboard Cite

show abstract

“…Appearing in this decade were ETS's first attempts at automated scoring, including of computer science subroutines (Braun et al 1990), architectural designs (Bejar 1991), mathematical step-by-step solutions and expressions (Bennett et al 1997;Sebrechts et al 1991), short-text responses (Kaplan 1992), and essays (Kaplan et al 1995). By the middle of the decade, the work on scoring architectural designs had been implemented operationally as part of the National Council of Architectural Registration Board's Architect Registration Examination (Bejar and Braun 1999).…”

Section: Constructed-response Formats and Performance Assessmentmentioning

confidence: 99%

Advancing Human Assessment: A Synthesis Over Seven Decades

Bennett

Davier

2017

Methodology of Educational Measurement and Assessment

View full text Add to dashboard Cite

This book has documented the history of ETS's contributions to educational research and policy analysis, psychology, and psychometrics. We close the volume with a brief synthesis in which we try to make more general meaning from the diverse directions that characterized almost 70 years of work.Synthesizing the breadth and depth of the topics covered over that time period is not simple. One way to view the work is across time. Many of the book's chapters presented chronologies, allowing the reader to follow the path of a research stream over the years. Less evident from these separate chronologies was the extent to which multiple streams of work not only coexisted but sometimes interacted.From its inception, ETS was rooted in Henry Chauncey's vision of describing individuals through broad assessment of their capabilities, helping them to grow and society to benefit (Elliot 2014). Chauncey's conception of broad assessment of capability required a diverse research agenda.Following that vision, his research managers assembled an enormous range of staff expertise. Only through the assemblage of such expertise could one bring diverse perspectives and frameworks from many fields to a problem, leading to novel solutions.In the following sections, we summarize some of the key research streams evident in different time periods, where each period corresponds to roughly a decade. Whereas the segmentation of these time periods is arbitrary, it does give a general This work was conducted while M. von Davier was employed with Educational Testing Service.

show abstract

“…The scoring accuracy for constructed-response items is generally lower than for multiple-choice items, but some in mathematics can be scored quite accurately, even compared to multiple-choice. For example, Bennett, Steffen, Singley, Morley, and Jacquemin (1997) found very high accuracy rates for the mathematical expressions (ME) response type when users entered expressions on the computer.…”

Section: Item Type Considerationsmentioning

confidence: 99%

Approaches to the Design of Diagnostic Item Models

Graf¹

2008

ETS Research Report Series

View full text Add to dashboard Cite

As part of its educational and social mission and in fulfilling the organization's nonprofit charter and bylaws, ETS has and continues to learn from and also to lead research that furthers educational and measurement research to advance quality and equity in education and assessment for all users of the organization's products and services. AbstractQuantitative item models are item structures that may be expressed in terms of mathematical variables and constraints. An item model may be developed as a computer program from which large numbers of items are automatically generated. Item models can be used to produce large numbers of items for use in traditional, large-scale assessments. But they have potential for use in other areas as well, including diagnostic assessment. In this report, I first review research on diagnostic assessment and then discuss how approaches to diagnostic assessment can inform the design of diagnostic item models.

show abstract

Evaluating an Automatically Scorable, Open‐Ended Response Type for Measuring Mathematical Reasoning in Computer‐Adaptive Tests

Cited by 45 publications

References 7 publications

Peer and Self Assessment in Massive Online Classes

Peer and Self Assessment in Massive Online Classes

Advancing Human Assessment: A Synthesis Over Seven Decades

Approaches to the Design of Diagnostic Item Models

Contact Info

Product

Resources

About