Heterogeneous Populations and Multistage Test Design

Duong, Minh Q.; Davier, Alina A. von

doi:10.1007/978-1-4614-9348-8_10

Cited by 4 publications

(4 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comparisons and investigations with multistage or computerized adaptive testing contexts would be especially valuable. These settings can sometimes produce bimodal and other nonnormal θ distributions, which may impact optimal test design and classification accuracy and consistency (see Duong & von Davier, ). These methods also typically require fewer items to yield similar classification performance to that of fixed‐length tests.…”

Section: Discussionmentioning

confidence: 99%

Does Maximizing Information at the Cut Score Always Maximize Classification Accuracy and Consistency?

Wyse

Babcock

2016

J Educational Measurement

View full text Add to dashboard Cite

A common suggestion made in the psychometric literature for fixed-length classification tests is that one should design tests so that they have maximum information at the cut score. Designing tests in this way is believed to maximize the classification accuracy and consistency of the assessment. This article uses simulated examples to illustrate that one can obtain higher classification accuracy and consistency by designing tests that have maximum test information at locations other than at the cut score. We show that the location where one should maximize the test information is dependent on the length of the test, the mean of the ability distribution in comparison to the cut score, and, to a lesser degree, whether or not one wants to optimize classification accuracy or consistency. Analyses also suggested that the differences in classification performance between designing tests optimally versus maximizing information at the cut score tended to be greatest when tests were short and the mean of ability distribution was further away from the cut score. Larger differences were also found in the simulated examples that used the 3PL model compared to the examples that used the Rasch model.An important function of many educational and psychological tests is to classify examinees into different performance categories. In K-12 settings, examinees might be classified into multiple performance categories, such as advanced, proficient, partially proficient, and basic, or simply proficient and not proficient. In licensure and certification (credentialing) testing, examinees are classified as having passed or failed an exam. Key considerations in these contexts are the classification accuracy and consistency of the assessment. Classification accuracy reports the extent to which observed classification agrees with "true" classification, and classification consistency is the proportion of examinees that would be classified into the same performance category over parallel replications of the assessment (Lee, 2010).Measurement experts have expressed a simple, intuitive, and widely accepted approach for how to develop fixed-length tests for maximizing classification accuracy and consistency of exam scores when using item response theory (IRT) models. That approach is to maximize test information near the cut score used to classify examinees (

show abstract

Section: Discussionmentioning

confidence: 99%

Does Maximizing Information at the Cut Score Always Maximize Classification Accuracy and Consistency?

Wyse

Babcock

2016

J Educational Measurement

View full text Add to dashboard Cite

show abstract

“…The last two procedures are particularly relevant to international populations. Duong and von Davier (2014) and von Davier, Holland, and Thayer (2004) discussed considerations for using appropriate scaling procedures based on kernel-equating 2 approaches. These approaches are particularly helpful in instances in which the sample size for at least one of the comparison groups is too small to use the more traditional item-response theory (IRT) models.…”

Section: Component 3: Generalizationmentioning

confidence: 99%

The Validity of Inferences From Locally Developed Assessments Administered Globally

Oliveri

Lawless

2018

ETS Research Report Series

View full text Add to dashboard Cite

In this paper, we first examine the challenges of score comparability associated with the use of assessments that are exported. By exported assessments, we mean assessments that are developed for domestic use and are then administered in other countries in either the same or a different language. Second, we provide suggestions to better support their valid and fair use. We illustrate these issues in the context of higher education assessments that are designed to serve different purposes—inform admissions decisions and assess student learning outcomes within one country (e.g., the United States) that are later used in other countries. In higher education, the use of exported assessments is on the rise due to increases in globalization, student mobility, and cross‐national comparisons of student achievement. An increase in the use of exported assessments leads to more diverse test‐taker populations and requires special attention due to possible sources of construct‐irrelevant variance, which may threaten the score‐based inferences made for various populations. Irrelevant sources of variance may emerge due to differences in opportunity to learn, curricular exposure, and lack of familiarity with the cultural references used in the assessments that are exported.

show abstract

“…Any consistent differences in scores between different groups of test takers that result from other factors not immediately related to the construct (i.e., "construct-irrelevant") -e.g., testtaker gender -may indicate that the test is unfair. Specifically, for a test to be fair, the non-random effects of construct-irrelevant factors need to be minimized during the four major phases of a test: test development, test administration, test scoring, and score interpretation (Xi, 2010;Zieky, 2016 (Angoff, 2012;Duong and von Davier, 2013;Oliveri and von Davier, 2016;Zieky, 2016 Bridgeman et al (2003) showed that, at least for some tests, examinee test scores may be affected by screen resolution of the monitors used to administer the test. This means that for such tests to be fair, it is necessary to ensure that all test-takers use monitors with a similar configuration.…”

Section: Ethics and Fairness In Constructedmentioning

confidence: 99%

“…However, this advantage is allowable because it is relevant to the construct of English comprehension. To ensure bias-free questions, the developers of the test conduct both qualitative and quantitative reviews of each question (Angoff, 2012;Duong and von Davier, 2013;Oliveri and von Davier, 2016;Zieky, 2016).…”

Section: Ethics and Fairness In Constructedmentioning

confidence: 99%

Building Better Open-Source Tools to Support Fairness in Automated Scoring

Madnani

Loukina

Davier³

et al. 2017

Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

View full text Add to dashboard Cite

Automated scoring of written and spoken responses is an NLP application that can significantly impact lives especially when deployed as part of high-stakes tests such as the GRE® and the TOEFL®. Ethical considerations require that automated scoring algorithms treat all testtakers fairly. The educational measurement community has done significant research on fairness in assessments and automated scoring systems must incorporate their recommendations. The best way to do that is by making available automated, non-proprietary tools to NLP researchers that directly incorporate these recommendations and generate the analyses needed to help identify and resolve biases in their scoring systems. In this paper, we attempt to provide such a solution.

show abstract

Heterogeneous Populations and Multistage Test Design

Cited by 4 publications

References 10 publications

Does Maximizing Information at the Cut Score Always Maximize Classification Accuracy and Consistency?

Does Maximizing Information at the Cut Score Always Maximize Classification Accuracy and Consistency?

The Validity of Inferences From Locally Developed Assessments Administered Globally

Building Better Open-Source Tools to Support Fairness in Automated Scoring

Contact Info

Product

Resources

About