At the Educational Testing Service, the Mantel‐Haenszel procedure is used for differential item functioning (DIF) detection and the standardization procedure is used to describe DIF. This report describes these procedures. First, an important distinction is made between DIF and Impact, pointing the need to compare the comparable. Then, these two contingency table DIF procedures are described in some detail, first in terms of their own origins as DIF procedures, and then from a common framework that points out similarities and differences. The relationship between the Mantel‐Haenszel procedure and IRT models in general and the Rasch model, in particular, is discussed. The utility of the standardization approach for assessing differential distractor functioning is described. Several issues in applied DIF analyses are discussed including inclusion of the studied item in the matching variable, and refinement of the matching variable. Future research topics dealing with the matching variable, the studied variable and the group variable are also discussed.
How does the fact that two tests should not be equated manifest itself? This paper addresses this question through the study of the degree to which equating functions fail to exhibit population invariance across subpopulations. Equating fimctions are supposed to be population invariant by definition. But, when two tests are not equatable, it is possible that the linking functions, used to connect the scores of one to the scores of the other, are not invariant across different populations of examinees. While no acceptable equating function is ever completely population invariant, in the situations where equating is usually performed we believe that the dependence of the equating function on the population used to compute it is usually small enough to be ignored. We introduce two root‐mean‐square difference measures of the degree to which the functions used to link two tests computed on different subpopulations differ from the linking function computed for the whole population. We also introduce the system of “parallel‐linear” linking functions for multiple subpopulations and show that, for this system, our measure of population invariance can be computed easily from the standardized mean differences between the scores of the subpopulations on the two tests. For the parallel‐linear case, we develop a correlation‐based upper bound on our measure that holds for all systems of subpopulations. We illustrate these ideas using data from the SAT I and from a concordance study of several combinations of ACT and SAT I scores, In the appendices, we give some theoretical results bearing on the other equating “requirements” of “same construct,”“same reliability” and one aspect of Lord's concept of equity.
The standardization method for assessing unexpected differential item performance or differential item functioning (DIF) is introduced. The principal findings of the first five studies that have used this approach on the Scholastic Aptitude Test are presented.
Equating functions are supposed to be population invariant by definition. But when two tests are not equatable, it is possible that the linking functions, used to connect the scores of one to the scores of the other, are not invariant across different populations of examinees. We introduce two root-meansquare difference measures of the degree to which linking function are different for different subpopulations. We also introduce the system of "parallel-linear" linking functions for multiple subpopulations and show that, for this system, our measure of population invariance can be easily computed from the standardized mean differences between the scores of the subpopulations on the two tests. For the parallel-linear case, we develop a correlation-based upper bound on our measure.
To the extent that outcomes of health assessment instruments are to be used interchangeably, the summary scores based on these outcomes need to be equated or made comparable. If the summary scores of different health assessment instruments are not equated, inferences based on them could be flawed. Ideally, summary scores would be comparable because of careful instrument design. In practice, that rarely happens. Statistical intervention is usually needed. This article addresses key questions associated with the linking of summary scores of health outcomes. What is meant by outcome linking and equating? How does equating differ from other types of linking? What common data collection designs are used to capture data for outcomes linking? What are some of the standard statistical procedures used to link outcomes directly? What assumptions do they make? What role does IRT play in linking outcomes? What assumptions do IRT methods make? This article makes a distinction between direct statistical adjustments of summary score distributions, and indirect procedures based on psychometric models of items or questions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.