Performance assessments appear on a priori grounds to be likely to produce far more local item dependence (LID) than that produced in the use of traditional multiple‐choice tests. This article (a) defines local item independence, (b) presents a compendium of causes of LID, (c) discusses some of LID's practical measurement implications, (d) details some empirical results for both performance assessments and multiple‐choice tests, and (e) suggests some strategies for managing LID in order to avoid negative measurement consequences.
Unidimensional item response theory (IRT) has become widely used in the analysis and equating of educational achievement tests. If an IRT model is true, item responses must be locally independent when the trait is held constant. This paper presents several measures of local dependence that are used in conjunction with the three-parameter logistic model in the analysis
A latent trait model goodness-of-fit statistic was defined, and its relationships to several other commonly used fit statistics were described. Simulation data were used to examine the behavior of these fit statistics under conditions similar to those found with real data. The simulation data were generated for 36 pseudo-items and 1,000 simulees using three-, two-, and one-parameter logistic latent trait models. The data were analyzed using three-, two-, and one-parameter models. Between-model comparisons were made of the fit statistics, trait estimates , and item parameter estimates. The three generating models produced clearly different patterns of results. The simulation results were compared to results for real data involving seventh-and eighth-grade students' performance on eight achievement tests. The achievement test results appeared most similar to the simulation results based on data generated with the three-parameter model. Some practical problems that can result from using an inappropriate model with multiple-choice tests are discussed. There are three latent trait models that are in common use, and they differ in terms of the number of item parameters they estimate to describe item characteristic functions. The three-parameter (3-PAR) model estimates item difficulties, discriminations, and lower asymptotes. The two-parameter (2-PAR) model estimates item difficulties and discriminations but assumes that all lower asymptotes are zero. The one-parameter (1-PAR), or Rasch, model estimates item difficulties but assumes that all item discriminations are constant and all lower asymptotes are zero. (See Allen & Yen, 1979; Lord & Novick, 1968; or Journal of Educational Measurement, Summer 1977, for a further discussion of these models.) When items have a multiple-choice format and it is possible for examinees of low ability to get an item correct through lucky guessing, the 2-PAR and 1-PAR models appear a priori to be inappropriate. However, in striving for simplicity, the researcher may be tempted to use the 2-PAR or 1-PAR model and hope that the inaccuracies that are introduced are unimportant. The researcher may turn to a statistical goodness-of-fit test to gauge the degree of these inaccuracies. This paper examines the behavior of a fit statistic, called Q,, which is similar to fit statistics that are commonly used for latent trait models. In the following sections (1) Q, is specified, (2) informal theoretical justification is given for using a chi-square distribution as an approximation to the distri
Two methods of constructing equal‐interval scales for educational achievement are discussed: Thurstone's absolute scaling method and Item Response Theory (IRT). Alternative criteria for choosing a scale are contrasted. It is argued that clearer criteria are needed for judging the appropriateness and usefulness of alternative scaling procedures, and more information is needed about the qualities of the different scales that are available. In answer to this second need, some examples are presented of how IRT can be used to examine the properties of scales: It is demonstrated that for observed score scales in common use (i.e., any scores that are influenced by measurement error), (a) systematic errors can be introduced when comparing growth at selected percentiles, and (b) normalizing observed scores will not necessarily produce a scale that is linearly related to an underlying normally distributed true trait.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.