Three open-ended response types-mathematical expression (ME), generating examples (GE), and graphical modeling (GM)-are described that could broaden the conception of mathematical problem solving used in computerized admissions tests. ME presents single-best-answer problems that call for an algebraic formalism, the correct rendition of which can take an infinite number of surface forms. GE presents loosely structured problems that can have many good answers taking the form of a value, letter pattern, expression, equation, or list. GM asks the examinee to represent a given situation by plotting points on a grid; these items can have a single best answer or multiple correct answers. For the three basic types, sample items are provided, the examinee interfaces and approaches to automated scoring are described, and research results are reported. It is illustrated how ME, GE, and GM can be combined to form extended constructed-response problems, and a description is offered of how item classes might be used as a basis for creating production-ready scoring keys. Index terms: automated scoring, computerbased testing, constructed response, mathematics performance assessment.The traditional paper-and-pencil (P&P), multiple-choice (MC) item format consists of a static stimulus followed by a series of response options. This format has served testing programs well for many years because its compactness allows for great breadth of coverage-many items can be administered in a short period. It is also cost efficient because it can be machine scored.However, this traditional format cannot effectively measure some constructs; in particular, when the target construct requires either a dynamic stimulus (e.g., listening comprehension) or a complex response (e.g., writing an essay, composing a computer program, producing a building design). To handle constructs requiring dynamic stimuli, large-scale testing programs have typically combined P&P with video or audio tape, producing a serviceable but expensive and administratively cumbersome assessment. To accommodate constructs calling for complex responses, testing programs have increasingly employed performance tasks, which also can be uneconomical due to the need for human scoring.Computerized testing has brought with it the potential for "new" assessment tasks. Some of these tasks might be more efficiently delivered versions of tasks used in traditional testing programs; others might be truly new, in that they measure constructs that could not be measured by P&P MC tests.These new tasks can be divided into three classes. In the first class, the stimulus is dynamic. An operational example is the Listening section of the Test of English as a Foreign Language (Educational Testing Service, 1999). Digitally recorded audio and context-setting photos are presented, followed by MC questions. One advantage of this digital presentation is the consistency in quality with which the same audio stimulus can be delivered from one examinee to the next.
We evaluated a machine-scorable, computer-delivered response type for measuring quantitative reasoning skill. "Generating Examples" (GE) is built around items that present constraints and ask candidates to give one or more answers that meet those constraints. These items are attractive because, like many real-world problems, GE items can have multiple correct answers. In addition, they appear to tap cognitive processes somewhat distinct from those measured by conventional quantitative questions. Nine GE forms were spiraled among a sample of academically precocious youth taking the Computerized SAT in association with a national talent search program. The forms differed in item manipulations designed to affect difficulty and in the expected time per item needed for solution. Results showed that across item lengths, the insertion of certain constraints increased difficulty. In addition, after correcting for attenuation, GE items similar in time requirements to SAT Mathematical items correlated in the mid-eighties to midnineties with SAT Mathematical scores, indicating that GE items might fit reasonably well with the SAT. key words: constructed response, mathematical reasoning, new item types Evaluating an Underdetermined Response Type for the Computerized SAT Large-scale assessment programs have typically measured mathematical skill using multiple-choice items having a single best answer. This format is used because it can be machine scored. However, relying solely on singlebest-answer questions arguably narrows construct representation because problems with multiple acceptable answers are found in most criterion situations and may require different solution processes (Frederiksen~1984).Generating Examples (GE) problems require the examinee to give responses that meet a specified set of constraints. In contrast to traditional measures, GE problems are typically underdetermined: not enough information is given to identify the answer uniquely. These problems, as a result, can have many--possibly an infinite number of--correct answers. (A simple instance would be: "Give three numbers for which the mean is at least twice the median.") A framework for understanding how GE and more conventional welldetermined problems differ has been proposed by Nhouyvanisvong, Katz, and Singley (1997). Based on cognitive analysis of examinee solution procedures, these investigators concluded that the strategies test candidates use to solve GE problems differ from the formal algebraic methods typically employed to attack well-determined (single-solution) items. In particular, GE problems require the use of a generate-and-test approach because they cannot be solved purely through manipulating algebraic equations. In a generateand-test approach, the examinee proposes a solution to the problem, then checks whether that solution meets the problem constraints. Generate-andtest (also called "guess-and-check") is frequently observed in solving conventional problems (e.g., Katz, Friedman, Berinett, & Berger, 1996). However, because GE p...
We evaluated a computer‐delivered response type for measuring quantitative skill. “Generating Examples” (GE) presents under‐determined problems that can have many right answers. We administered two GE tests that differed in the manipulation of specific item features hypothesized to affect difficulty. Analyses related to internal consistency reliability, external relations, and features contributing to item difficulty, adverse impact, and examinee perceptions. Results showed that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills. Item features that increased difficulty included asking examinees to supply more than one correct answer and to identify whether an item was solvable. Gender differences were similar to those found on the GRE quantitative and analytical test sections. Finally, examinees were divided on whether GE items were a fairer indicator of ability than multiple‐choice items, but still overwhelmingly preferred to take the more conventional questions.
This study investigated the psychometric functioning of Graphical Modeling (GM)--a new computer-delivered response type for assessing mathematical reasoning that asks candidates to respond to a problem situation by creating a graphical representation.GM problems can be like the single-best-answer items currently found on the General Test or they can be more loosely defined, allowing for multiple correct responses. Two GM tests differing from one another in the manipulation of specific item features were randomly spiraled among study participants. Analyses were performed relating to internal consistency reliability, relations with external criteria, features that contribute to item difficulty, adverse gender impact, and examinee perceptions. Results showed that GM scores were very reliable and moderately related to the General Test's quantitative section, suggesting that the introduction of GM items on the General Test might help broaden the GRE quantitative construct.In exploratory analyses of difficulty, one of three manipulated item features, problem structure, had a significant effect. Our impact analyses detected no significant gender differences independent of those associated with the GRE quantitative section. Finally, while more participants preferred regular multiple-choice graphical reasoning questions to GM items, more also thought GM was the fairer indicator of their ability to undertake graduate study.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.