Wendy M. Yen scite author profile

Performance assessments appear on a priori grounds to be likely to produce far more local item dependence (LID) than that produced in the use of traditional multiple‐choice tests. This article (a) defines local item independence, (b) presents a compendium of causes of LID, (c) discusses some of LID's practical measurement implications, (d) details some empirical results for both performance assessments and multiple‐choice tests, and (e) suggests some strategies for managing LID in order to avoid negative measurement consequences.

show abstract

Effects of Local Item Dependence on the Fit and Equating Performance of the Three-Parameter Logistic Model

Yen¹

1984

Applied Psychological Measurement

516

480

View full text Add to dashboard Cite

show abstract

Using Simulation Results to Choose a Latent Trait Model

Yen¹

1981

Applied Psychological Measurement

237

234

View full text Add to dashboard Cite

A latent trait model goodness-of-fit statistic was defined, and its relationships to several other commonly used fit statistics were described. Simulation data were used to examine the behavior of these fit statistics under conditions similar to those found with real data. The simulation data were generated for 36 pseudo-items and 1,000 simulees using three-, two-, and one-parameter logistic latent trait models. The data were analyzed using three-, two-, and one-parameter models. Between-model comparisons were made of the fit statistics, trait estimates , and item parameter estimates. The three generating models produced clearly different patterns of results. The simulation results were compared to results for real data involving seventh-and eighth-grade students' performance on eight achievement tests. The achievement test results appeared most similar to the simulation results based on data generated with the three-parameter model. Some practical problems that can result from using an inappropriate model with multiple-choice tests are discussed. There are three latent trait models that are in common use, and they differ in terms of the number of item parameters they estimate to describe item characteristic functions. The three-parameter (3-PAR) model estimates item difficulties, discriminations, and lower asymptotes. The two-parameter (2-PAR) model estimates item difficulties and discriminations but assumes that all lower asymptotes are zero. The one-parameter (1-PAR), or Rasch, model estimates item difficulties but assumes that all item discriminations are constant and all lower asymptotes are zero. (See Allen & Yen, 1979; Lord & Novick, 1968; or Journal of Educational Measurement, Summer 1977, for a further discussion of these models.) When items have a multiple-choice format and it is possible for examinees of low ability to get an item correct through lucky guessing, the 2-PAR and 1-PAR models appear a priori to be inappropriate. However, in striving for simplicity, the researcher may be tempted to use the 2-PAR or 1-PAR model and hope that the inaccuracies that are introduced are unimportant. The researcher may turn to a statistical goodness-of-fit test to gauge the degree of these inaccuracies. This paper examines the behavior of a fit statistic, called Q,, which is similar to fit statistics that are commonly used for latent trait models. In the following sections (1) Q, is specified, (2) informal theoretical justification is given for using a chi-square distribution as an approximation to the distri

show abstract

The Choice of Scale for Educational Measurement: An Irt Perspective

Yen¹

1986

J Educational Measurement

106

107

View full text Add to dashboard Cite

Two methods of constructing equal‐interval scales for educational achievement are discussed: Thurstone's absolute scaling method and Item Response Theory (IRT). Alternative criteria for choosing a scale are contrasted. It is argued that clearer criteria are needed for judging the appropriateness and usefulness of alternative scaling procedures, and more information is needed about the qualities of the different scales that are available. In answer to this second need, some examples are presented of how IRT can be used to examine the properties of scales: It is demonstrated that for observed score scales in common use (i.e., any scores that are influenced by measurement error), (a) systematic errors can be introduced when comparing growth at selected percentiles, and (b) normalizing observed scores will not necessarily produce a scale that is linearly related to an underlying normally distributed true trait.

show abstract

A comparison of the efficiency and accuracy of BILOG and LOGIST

Yen¹

1987

Psychometrika

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Wendy M. Yen

Scaling Performance Assessments: Strategies for Managing Local Item Dependence

Effects of Local Item Dependence on the Fit and Equating Performance of the Three-Parameter Logistic Model

Using Simulation Results to Choose a Latent Trait Model

The Choice of Scale for Educational Measurement: An Irt Perspective

A comparison of the efficiency and accuracy of BILOG and LOGIST

Contact Info

Product

Resources

About