The Long-Term Sustainability of Different Item Response Theory Scaling Methods

Keller, Lisa A.; Keller, Robert R.

doi:10.1177/0013164410375111

Cited by 23 publications

(27 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…SL produced slightly greater passing misclassifications than MS when there was a moderate or sizable amount of ability shift. To a great extent, our findings are not in opposition to the results of related studies (Pang et al, 2010; Keller and Keller, 2011; Keller and Hambleton, 2013). …”

Section: Discussioncontrasting

confidence: 62%

“…As noted in previous equating studies (Keller and Keller, 2011; Keller and Hambleton, 2013; Kolen and Brennan, 2014), model fit is a strong assumption that IRT equating is based on. Only when the fit between the model and the empirical data of interest is satisfactory, can the IRT equating be appropriately applied.…”

Section: Introductionmentioning

confidence: 86%

“…Previous studies have shown that the SL method and the FCIP procedure performed similarly, and both outperformed the MS method in recovering ability changes (Pang et al, 2010; Keller and Keller, 2011; Keller and Hambleton, 2013). With dichotomous data, the characteristic curve methods performed better than the FCIP procedure when there was a mean shift in ability distribution (Keller and Keller, 2011); with mixed-format test data, however, FCIP performed best comparing to the characteristic curve methods (Keller and Hambleton, 2013).…”

Section: Introductionmentioning

confidence: 88%

“…With dichotomous data, the characteristic curve methods performed better than the FCIP procedure when there was a mean shift in ability distribution (Keller and Keller, 2011); with mixed-format test data, however, FCIP performed best comparing to the characteristic curve methods (Keller and Hambleton, 2013). …”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Practical Consequences of Item Response Theory Model Misfit in the Context of Test Equating with Mixed-Format Test Data

Zhao

Hambleton

2017

Front. Psychol.

View full text Add to dashboard Cite

In item response theory (IRT) models, assessing model-data fit is an essential step in IRT calibration. While no general agreement has ever been reached on the best methods or approaches to use for detecting misfit, perhaps the more important comment based upon the research findings is that rarely does the research evaluate IRT misfit by focusing on the practical consequences of misfit. The study investigated the practical consequences of IRT model misfit in examining the equating performance and the classification of examinees into performance categories in a simulation study that mimics a typical large-scale statewide assessment program with mixed-format test data. The simulation study was implemented by varying three factors, including choice of IRT model, amount of growth/change of examinees’ abilities between two adjacent administration years, and choice of IRT scaling methods. Findings indicated that the extent of significant consequences of model misfit varied over the choice of model and IRT scaling methods. In comparison with mean/sigma (MS) and Stocking and Lord characteristic curve (SL) methods, separate calibration with linking and fixed common item parameter (FCIP) procedure was more sensitive to model misfit and more robust against various amounts of ability shifts between two adjacent administrations regardless of model fit. SL was generally the least sensitive to model misfit in recovering equating conversion and MS was the least robust against ability shifts in recovering the equating conversion when a substantial degree of misfit was present. The key messages from the study are that practical ways are available to study model fit, and, model fit or misfit can have consequences that should be considered when choosing an IRT model. Not only does the study address the consequences of IRT model misfit, but also it is our hope to help researchers and practitioners find practical ways to study model fit and to investigate the validity of particular IRT models for achieving a specified purpose, to assure that the successful use of the IRT models are realized, and to improve the applications of IRT models with educational and psychological test data.

show abstract

Section: Discussioncontrasting

confidence: 62%

Section: Introductionmentioning

confidence: 86%

Section: Introductionmentioning

confidence: 88%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Practical Consequences of Item Response Theory Model Misfit in the Context of Test Equating with Mixed-Format Test Data

Zhao

Hambleton

2017

Front. Psychol.

View full text Add to dashboard Cite

show abstract

“…In particular, the Stocking and Lord Test Characteristic Curve method (SL;Stocking & Lord, 1983) has been shown to exhibit positive bias when there is a positive change in the ability distribution (e.g., Baldwin, Nering, & Baldwin, 2007), while the results from concurrent calibration have been mixed (e.g., Hanson & Béguin, 2002;Kim & Cohen, 1998). Fixed common item parameter (FCIP) scaling has been shown to produce minimal bias when two forms are scaled using this method, but does show some increasing bias as the number of scalings increases (Keller & Keller, 2011).…”

mentioning

confidence: 99%

The Effect of Changing Content on IRT Scaling Methods

Keller

2015

Applied Measurement in Education

Self Cite

View full text Add to dashboard Cite

Equating test forms is an essential activity in standardized testing, with increased importance with the accountability systems in existence through the mandate of Adequate Yearly Progress. It is through equating that scores from different test forms become comparable, which allows for the tracking of changes in the performance of students from one year to the next. This study compares three different item response theory scaling methods (fixed common item parameter, Stocking & Lord, and Concurrent Calibration) with respect to examinee classification into performance categories, and estimation of the ability parameter, when the content of the test form changes slightly from year to year, and the examinee ability distribution changes. The results indicate that calibration methods, especially concurrent calibration, produced more stable results than the transformation method.Equating test forms is an essential activity in standardized testing, with increased importance with the accountability systems in existence through the mandate of adequate yearly progress (AYP). It is through equating that scores from different test forms become comparable, which allows for the tracking of changes in the performance of students from one year to the next. Given that there are different methods for equating, the choice of equating methods could lead to different results, and different inferences about the nature of the achievement of the students in question. As such, choosing the most appropriate equating method for the particular situation is essential. While there may not be an equating method that is best in all situations, studies that examine the context of interest can help inform the choice of methods.In item response theory (IRT), there are primarily two methods of equating: observed score equating and true score equating. Although there are only two popular methods of equating the tests, in the context of IRT, the heart of equating is actually a scaling step in which the parameter estimates from different calibrations are put onto a common scale, or metric. There are several popular scaling techniques that are usually implemented in IRT, and the research regarding these methods is not conclusive. The best method typically depends on the context of the equating. In particular, the degree of equivalence between groups that are used in the scaling and equating appears to be a factor that differentiates scaling methods. In the case of equivalent groups, very little difference in results have been observed while with non-equivalent groups, differences have

show abstract

Long‐Term Impact of Valid Case Criterion on Capturing Population‐Level Growth Under Item Response Theory Equating

Deng

Monfils

2017

ETS Research Report Series

View full text Add to dashboard Cite

Using simulated data, this study examined the impact of different levels of stringency of the valid case inclusion criterion on item response theory (IRT)‐based true score equating over 5 years in the context of K–12 assessment when growth in student achievement is expected. Findings indicate that the use of the most stringent inclusion criterion generally yielded the most accurate results when overall root mean square error (RMSE) and bias were considered under both zero‐growth and growth conditions, for both one‐parameter logistic (1PL) and three‐parameter logistic (3PL) IRT models, and for both fixed common item parameter (FCIP) and test characteristic curve (TCC) scaling methods. The positive impact of applying the most stringent valid case inclusion criterion was more salient with the 3PL model, under which greater classification accuracy was observed.

show abstract

The Long-Term Sustainability of Different Item Response Theory Scaling Methods

Cited by 23 publications

References 15 publications

Practical Consequences of Item Response Theory Model Misfit in the Context of Test Equating with Mixed-Format Test Data

Practical Consequences of Item Response Theory Model Misfit in the Context of Test Equating with Mixed-Format Test Data

The Effect of Changing Content on IRT Scaling Methods

Long‐Term Impact of Valid Case Criterion on Capturing Population‐Level Growth Under Item Response Theory Equating

Contact Info

Product

Resources

About