Does Maximizing Information at the Cut Score Always Maximize Classification Accuracy and Consistency?

Wyse, Adam E.; Babcock, Ben

doi:10.1111/jedm.12099

Cited by 7 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The findings of this study are aligned with what one would expect based upon previous empirical work. As Wyse and Babcock () demonstrated, group ability and the location of maximum information of test items have an impact on classification accuracy. It would therefore be expected that classification accuracy would be different for subgroups on an examination, depending upon the location of maximum test information and subgroup means.…”

Section: Discussionmentioning

confidence: 99%

“…These test lengths were chosen to approximate reliability values of .70, .80, and .90, with .70 indicative of low reliability, .80 indicative of moderate reliability, and .90 indicative of high reliability. Additionally, Wyse and Babcock () found larger differences in classification accuracy when the number of items was at or below 50 items. Reliability was calculated as Rasch person separation reliability (Linacre, ).…”

Section: Methodsmentioning

confidence: 99%

“…The location of maximum information was manipulated by varying the mean Rasch difficulty of the items. Wyse and Babcock () found that classification accuracy may be highest by maximizing information away from the cut score, depending on examinee ability and reliability. Therefore, the location of maximum information was varied by setting the mean difficulty of items at the cut score, +.5 standard deviations from the cut score, and +1.0 standard deviations from the cut score (i.e., half standard deviation increments from the cut score on the person scale).…”

Section: Methodsmentioning

confidence: 99%

“…Although maximizing information around the cut score is standard for test construction, there are scenarios where accuracy might be higher by maximizing test information in locations on the ability spectrum other than the cut score. Wyse and Babcock () demonstrated that classification accuracy may be higher by targeting information closer to the group mean when the cut score is far from the group mean and test lengths are shorter (i.e., less reliable scores). For tests with less reliable scores, for example, classification accuracy is higher when information is targeted closer to the group mean, rather than the cut score.…”

Section: The Current Researchmentioning

confidence: 99%

See 3 more Smart Citations

The Invariance Paradox: Using Optimal Test Design to Minimize Bias

Jones¹,

Kopp²,

Ong

2019

Educational Measurement

View full text Add to dashboard Cite

Studies investigating invariance have often been limited to measurement or prediction invariance. Selection invariance, wherein the use of test scores for classification results in equivalent classification accuracy between groups, has received comparatively little attention in the psychometric literature. Previous research suggests that some form of selection bias (lack of selection invariance) will exist in most testing contexts, where classification decisions are made, even when meeting the conditions of measurement invariance. We define this conflict between measurement and selection invariance as the invariance paradox. Previous research has found test reliability to be an important factor in minimizing selection bias. This study demonstrates that the location of maximum test information may be a more important factor than overall test reliability in minimizing decision errors between groups.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: The Current Researchmentioning

confidence: 99%

See 2 more Smart Citations

The Invariance Paradox: Using Optimal Test Design to Minimize Bias

Jones¹,

Kopp²,

Ong

2019

Educational Measurement

View full text Add to dashboard Cite

show abstract

“…Depending on the purpose of the test, a better item might be the one that maximizes the information at another ability level. For example, in a licensure examination, where the aim is to determine whether examinees are above or below a cut score, test developer might want to administer items that have maximum information at the cut score or somewhere close to cut score (Wyse & Babcock, 2016). The exact locations of the examinees on the ability scale might not be the primary purpose of the examination.…”

Section: The Optimum Item and Item Poolmentioning

confidence: 99%

Quality of Item Pool (QIP) Index: A Novel Approach to Evaluating CAT Item Pool Adequacy

Gonulates

2019

Educational and Psychological Measurement

View full text Add to dashboard Cite

This article introduces the Quality of Item Pool (QIP) Index, a novel approach to quantifying the adequacy of an item pool of a computerized adaptive test for a given set of test specifications and examinee population. This index ranges from 0 to 1, with values close to 1 indicating the item pool presents optimum items to examinees throughout the test. This index can be used to compare different item pools or diagnose the deficiencies of a given item pool by quantifying the amount of deviation from a perfect item pool. Simulation studies were conducted to evaluate the capacity of this index for detecting the inadequacies of two simulated item pools. The value of this index was compared with the existing methods of evaluating the quality of computerized adaptive tests (CAT). Results of the study showed that the QIP Index can detect even slight deviations between a proposed item pool and an optimal item pool. It can also uncover shortcomings of an item pool that other outcomes of CAT cannot detect. CAT developers can use the QIP Index to diagnose the weaknesses of the item pool and as a guide for improving item pools.

show abstract

Stakes in Testing: Not a Simple Dichotomy but a Profile of Consequences That Guides Needed Evidence of Measurement Quality

Tannenbaum

Kane

2019

ETS Research Report Series

View full text Add to dashboard Cite

Testing programs are often classified as high or low stakes to indicate how stringently they need to be evaluated. However, in practice, this classification falls short. A high‐stakes label is taken to imply that all indicators of measurement quality must meet high standards; whereas a low‐stakes label is taken to imply the opposite. This approach can result in inappropriate allocation of resources and inadequate attention to needed evidence. We argue that “stakes” are better thought of as a profile of consequences. We suggest generalizable criteria for evaluating and responding to stakes in testing, with applications to licensure, employment, and K–12 accountability testing.

show abstract

Does Maximizing Information at the Cut Score Always Maximize Classification Accuracy and Consistency?

Cited by 7 publications

References 18 publications

The Invariance Paradox: Using Optimal Test Design to Minimize Bias

The Invariance Paradox: Using Optimal Test Design to Minimize Bias

Quality of Item Pool (QIP) Index: A Novel Approach to Evaluating CAT Item Pool Adequacy

Stakes in Testing: Not a Simple Dichotomy but a Profile of Consequences That Guides Needed Evidence of Measurement Quality

Contact Info

Product

Resources

About