The Effect of Changing Content on IRT Scaling Methods

Keller, Lisa A.; Keller, Robert R.

doi:10.1080/08957347.2014.1002922

Cited by 2 publications

(2 citation statements)

References 16 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bias depends on the sample size (Hanson & Béguin, 2002; Kang & Petersen, 2012), the number of items with parameters available from previous calibrations (e.g., Arai & Mayekawa, 2011; Kim, Cole, & Mwavita, 2018), the amount of cross‐national DIF (Sachse, Roppelt, & Haag, 2016), and shifts in the latent ability distributions across assessments (e.g., Baldwin, Baldwin, & Nering, 2007; Keller, Keller, & Baldwin, 2007). Keller and Keller (2011, 2015), however, showed that FIPC works best for complex changes in the latent ability distributions and in cases where the content of the assessment changes. Zhao and Hambleton (2017) showed that FIPC was robust against ability shifts across two adjacent assessments.…”

Section: Purpose Of the Study And Research Questionsmentioning

confidence: 99%

The Benefits of Fixed Item Parameter Calibration for Parameter Accuracy in Small Sample Situations in Large‐Scale Assessments

König

Khorramdel

Yamamoto

et al. 2020

Educational Measurement

View full text Add to dashboard Cite

Large‐scale assessments such as the Programme for International Student Assessment (PISA) have field trials where new survey features are tested for utility in the main survey. Because of resource constraints, there is a trade‐off between how much of the sample can be used to test new survey features and how much can be used for the initial item response theory (IRT) scaling. Utilizing real assessment data of the PISA 2015 Science assessment, this article demonstrates that using fixed item parameter calibration (FIPC) in the field trial yields stable item parameter estimates in the initial IRT scaling for samples as small as n = 250 per country. Moreover, the results indicate that for the recovery of the county‐specific latent trait distributions, the estimates of the trend items (i.e., the information introduced into the calibration) are crucial. Thus, concerning the country‐level sample size of n = 1,950 currently used in the PISA field trial, FIPC is useful for increasing the number of survey features that can be examined during the field trial without the need to increase the total sample size. This enables international large‐scale assessments such as PISA to keep up with state‐of‐the‐art developments regarding assessment frameworks, psychometric models, and delivery platform capabilities.

show abstract

Section: Purpose Of the Study And Research Questionsmentioning

confidence: 99%

The Benefits of Fixed Item Parameter Calibration for Parameter Accuracy in Small Sample Situations in Large‐Scale Assessments

König

Khorramdel

Yamamoto

et al. 2020

Educational Measurement

View full text Add to dashboard Cite

show abstract

“…If equating studies vary the δ or sometimes the θ distribution, then mostly to challenge the equity requirement, in the form of a shift of the mean, i.e., simulating change or growth, as in Kopp and Jones (2020), Han et al (2012), He et al (2013), or Waterbury and DeMars (2021). The skewness of the distribution of the θ parameter is also sometimes varied, but by preserving good targeting, that is, with overlapping dispersion of the δ and θ , as in Manna and Gu (2019) or Keller and Keller (2015). In any case, Suanthong et al (2000) mention, citing L.…”

Section: Introductionmentioning

confidence: 99%

What Affects the Quality of Score Transformations? Potential Issues in True-Score Equating Using the Partial Credit Model

Fellinghauer

Debelak

Strobl

2023

Educational and Psychological Measurement

View full text Add to dashboard Cite

This simulation study investigated to what extent departures from construct similarity as well as differences in the difficulty and targeting of scales impact the score transformation when scales are equated by means of concurrent calibration using the partial credit model with a common person design. Practical implications of the simulation results are discussed with a focus on scale equating in health-related research settings. The study simulated data for two scales, varying the number of items and the sample sizes. The factor correlation between scales was used to operationalize construct similarity. Targeting of the scales was operationalized through increasing departure from equal difficulty and by varying the dispersion of the item and person parameters in each scale. The results show that low similarity between scales goes along with lower transformation precision. In cases with equal levels of similarity, precision improves in settings where the range of the item parameters is encompassing the person parameters range. With decreasing similarity, score transformation precision benefits more from good targeting. Difficulty shifts up to two logits somewhat increased the estimation bias but without affecting the transformation precision. The observed robustness against difficulty shifts supports the advantage of applying a true-score equating methods over identity equating, which was used as a naive baseline method for comparison. Finally, larger sample size did not improve the transformation precision in this study, longer scales improved only marginally the quality of the equating. The insights from the simulation study are used in a real-data example.

show abstract

The Effect of Changing Content on IRT Scaling Methods

Cited by 2 publications

References 16 publications

The Benefits of Fixed Item Parameter Calibration for Parameter Accuracy in Small Sample Situations in Large‐Scale Assessments

The Benefits of Fixed Item Parameter Calibration for Parameter Accuracy in Small Sample Situations in Large‐Scale Assessments

What Affects the Quality of Score Transformations? Potential Issues in True-Score Equating Using the Partial Credit Model

Contact Info

Product

Resources

About