A Simulation Study of Methods for Assessing Differential Item Functioning in Computerized Adaptive Tests

Zwick, Rebecca; Thayer, Dorothy T.; Wingersky, Marilyn S.

doi:10.1177/014662169401800203

Cited by 43 publications

(55 citation statements)

References 15 publications

(17 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition they found that pretest DIF statistics were generally well behaved, but the MH DIF statistics tended to have larger standard errors for the pretest items than for the CAT items. Zwick et al (1994) addressed the effect of using alternative matching methods for pretest items. Using a more elegant matching procedure did not lead to a reduction of the MH standard errors and produced DIF measures that were nearly identical to those from the earlier study.…”

Section: Subsequent Developments With the Mantel-haenszel (Mh) Approachmentioning

confidence: 99%

“…Wainer (1993) provided an IRT-based effect size of amount of DIF that is based on the STAND weighting system that allows one to weight difference in the item response functions (IRF) in a manner that is proportional to the density of the ability distribution. Zwick et al (1994) and Zwick et al (1995) applied the Rasch model to data simulated according to the 3PL model. They found that the DIF statistics based on the Rasch model were highly correlated with the DIF values associated with the generated data, but that they tended to be smaller in magnitude.…”

Section: Item Response Theory (Irt)mentioning

confidence: 99%

See 1 more Smart Citation

Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

Dorans¹

2017

Methodology of Educational Measurement and Assessment

View full text Add to dashboard Cite

Section: Subsequent Developments With the Mantel-haenszel (Mh) Approachmentioning

confidence: 99%

Section: Item Response Theory (Irt)mentioning

confidence: 99%

Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

Dorans¹

2017

Methodology of Educational Measurement and Assessment

View full text Add to dashboard Cite

“…In addition they found that pretest DIF statistics were generally well behaved, but the MH DIF statistics tended to have larger standard errors for the pretest items than for the CAT items. Zwick, Thayer, and Wingersky (1994) addressed the effect of using alternative matching methods for pretest items. Using a more elegant matching procedure did not lead to a reduction of the MH standard errors and produced DIF measures that were nearly identical to those from the earlier study.…”

Section: Subsequent Developments With the Mantel-haenszel (Mh) Approamentioning

confidence: 99%

“…Wainer (1993) provided an IRT-based effect size of amount of DIF that is based on the STAND weighting system that allows one to weight difference in the item response functions (IRF) in a manner that is proportional to the density of the ability distribution. Zwick et al (1994) and Zwick, Thayer, and Wingersky (1995) Thissen et al (1993). Zwick (1989Zwick ( , 1990 demonstrated that the null definition of DIF for the MH procedure (and hence STAND and other procedures employing observed scores as matching variables) and the null hypothesis based on IRT are different because the latter compares item response curves, which in essence condition on unobserved ability.…”

Section: Item Response Theory (Irt)mentioning

confidence: 99%

Ets Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

Dorans

2013

ETS Research Report Series

View full text Add to dashboard Cite

Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS R&D Scientific and Policy Contributions series, undergo a formal peerreview process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS R&D Scientific and Policy Contributions series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of ETS.The Daniel Eignor Editorship is named in honor of Dr. Daniel R. Eignor, who from 2001 until 2011 served the Research and Development division as Editor for the ETS Research Report series. The Eignor Editorship has been created to recognize the pivotal leadership role that Dr. Eignor played in the research publication process at ETS. i Abstract Quantitative fairness procedures have been developed and modified by ETS staff over the past several decades. ETS has been a leader in fairness assessment, and its efforts are reviewed in this report. The first section deals with differential prediction and differential validity procedures that examine whether test scores predict a criterion, such as performance in college, across different subgroups in a similar manner. The bulk of this report focuses on item level fairness, or differential item functioning, which is addressed in the various subsections of the second section. ETS Contributions to theIn the third section, I consider research pertaining to whether tests built to the same set of specifications produce scores that are related in the same way across different gender and ethnic groups. Limitations with the approaches reviewed here are discussed in the final section.Key words: fairness, differential prediction, differential item functioning, score equity assessment, ETS, quantitative methods ii ForewordSince its founding in 1947, ETS has conducted a significant and wide-ranging research program that has focused on, among other things, psychometric and statistical methodology; educational evaluation; performance assessment and scoring; large-scale assessment and evaluation; cognitive, developmental, personality, and social psychology; and education policy. This broadbased research program has helped build the science and practice of educational measurement, as well as inform policy debates.In 2010, we began to synthesize these scientific and policy contributions, with the intention to release a series of reports sequentially over the course of the next few years. These reports constitute the ETS R&D Scientific and Policy Contributions Se...

show abstract

“…The matching items were subsets of the 75 items used in the simulation conducted by Zwick, Thayer, and Wingersky (1994). (The selection of item parameter values for this earlier study was based on analyses of actual test data.)…”

Section: Matching Itemsmentioning

confidence: 99%

Describing and Categorizing Dif in Polytomous Items

Zwick

Thayer

Mazzeo

1997

ETS Research Report Series

Self Cite

View full text Add to dashboard Cite

The purpose of this project was to evaluate statistical procedures for assessing differential item functioning (DIF) in polytomous items (items with more than two score categories). Three descriptive statistics—the Standardized Mean Difference, or SMD (Dorans & Schmitt, 1991), and two procedures based on SIBTEST (Shealy & Stout, 1993) were considered, along with five inferential procedures—two based on SMD, two based on SIBTEST, and the Mantel (1963) method. The DIF procedures were evaluated through applications to simulated data, as well as data from ETS tests. The simulation included conditions in which the two groups of examinees had the same ability distribution and conditions in which the group means differed by one standard deviation. When the two groups had the same distribution, the descriptive index that performed best was the SMD. When the two groups had different distributions, a modified form of the SIBTEST DIF effect size measure tended to perform best. The five inferential procedures performed almost indistinguishably when the two groups had identical distributions. When the two groups had different distributions and the studied item was highly discriminating, the SIBTEST procedures showed much better Type I error control than did the SMD and Mantel methods, particularly in short tests. The power ranking of the five procedures was inconsistent; it depended on the direction of DIF and other factors. Routine application of these polytomous DIF methods at ETS seems feasible in cases where a reliable test is available for matching examinees. For the Mantel and SMD methods, Type I error control may be a concern under certain conditions. In the case of SIBTEST, the current version cannot easily accommodate matching tests that do not use number‐right scoring. Additional research in these areas is likely to be useful.

show abstract

A Simulation Study of Methods for Assessing Differential Item Functioning in Computerized Adaptive Tests

Cited by 43 publications

References 15 publications

Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

Ets Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

Describing and Categorizing Dif in Polytomous Items

Contact Info

Product

Resources

About