This study compared the accuracies of four differential item functioning (DIF) estimation methods, where each method makes use of only one of the following: raw data, logistic regression, loglinear models, or kernel smoothing. The major focus was on the estimation strategies' potential for estimating score-level, conditional DIF. A secondary focus was on assessing the accuracy of strategies' overall DIF effect sizes and statistical significance tests. A real data simulation was used to evaluate the estimation strategies with 6 items representing DIF and No DIF situations, and with 4 sample size combinations for the reference and focal group data. Results showed that the logistic regression estimation strategy was the most highly recommended strategy in terms of the bias and variability of its estimates and the power of its statistical significance test. The loglinear models strategy had flexibility advantages, but these advantages only offset the greater variability of its estimates and its reduced statistical power when sample sizes were large. The kernel smoothing estimation strategy was the least accurate of the considered strategies due to estimation problems when the reference and focal groups differed in overall ability.Key words: DIF, kernel smoothing, loglinear models, logistic regression ii While the psychometric literature has defined differential item functioning (DIF) as a performance difference between examinee groups at one level of ability (Dorans & Holland, 1993;Lord, 1980;Shepard, 1982), considerable research has focused on developing and comparing DIF detection methods that summarize DIF across a total range of ability (Dorans & Kulick, 1986;Holland & Thayer, 1988;Kristjansson, Aylesworth, McDowell, & Zumbo, 2005;Roussos & Stout, 1996;Shealy & Stout, 1993;Swaminathan & Rogers, 1990;Zumbo, 1999;Zwick, Thayer, & Lewis, 2000). This work usually focuses on overall statistical significance tests of summary DIF indexes and, to a lesser extent, on the use of summary DIF indexes as overall effect sizes. Due to the potential of all summary measures to oversummarize in special circumstances (to be described below), effect sizes and significance tests of overall DIF may benefit by being supplemented with assessments of conditional, ability-level DIF. The purpose of this study was to compare the accuracies of four DIF estimation strategies for estimating conditional DIF (raw data, logistic regression, loglinear models, and kernel smoothing).
Assessing Differential Item Functioning (DIF)The assessment of DIF is a determination of whether a studied item, Y, performs differently for reference examinees, R, and focal examinees, F, conditioned on the M levels of a variable that measures reference and focal examinees' overall ability, m X . In this study, Y is dichotomously scored. m X denotes an observed test score that excludes Y and all items containing extensive DIF, or C-DIF (Dorans & Holland, 1993). In typical DIF assessments, the M differences in (1) are summarized rather than individually evalua...