A Comparison of Several Goodness-of-Fit Statistics

McKinley, Robert L.; Mills, Craig N.

doi:10.1177/014662168500900105

Cited by 114 publications

(109 citation statements)

References 7 publications

Supporting

Mentioning

104

Contrasting

Unclassified

Order By: Relevance

“…A number of studies have compared the 3PL model with 2PL and/or 1PL models in terms of model-data fit (Hambleton & Murray, 1983;McKinley & Mills, 1985;Swaminathan & Gifford, 1979;Yen, 1981). In general, results of these studies suggest that the 3PL model will provide better fit at the item level than the 2PL or 1PL models, unless the data are simulated to fit these latter models.…”

Section: Description Of Toefl Equating Designmentioning

confidence: 99%

“…In general, results of these studies suggest that the 3PL model will provide better fit at the item level than the 2PL or 1PL models, unless the data are simulated to fit these latter models. McKinley and Mills (1985) reported that when data were generated with the 3PL model, the 2PL model showed considerably more misfit than the 3PL model in terms of the proportions of items identified as misfitting by several goodness-of-fit statistics. However, under similar conditions, Yen (1981) found that the 2PL model fit the data almost as well as the 3PL model.…”

Section: Description Of Toefl Equating Designmentioning

confidence: 99%

See 1 more Smart Citation

An Investigation of the Use of Simplified Irt Models for Scaling and Equating the Toefl Test

Way

Reese

1990

ETS Research Report Series

View full text Add to dashboard Cite

A continuing program of research related to the TOEFL test is carried out under the direction of the TOEFL Research Committee. Its six members include representatives of the Policy Council. the TOEFL Committee of Examiners, and distinguished English as a second language specialists from the academic community. Currently the Committee meets twice yearly to review and approve proposals for test-related research and to set guidelines for the entire scope of the TOEFL research program. Members of the Research Committee serve three-year terms at the invitation of the Policy Council; the chair of the committee serves on the Policy Council.Because the studies are specific to the test and the testing program, most of the actual research is conducted by ETS staff rather than by outside researchers. However, many projects require the cooperation of other institutions, particularly those with programs in the teaching of English as a foreign or second language. Representatives of such programs who are interested in participating in or conducting TOEFL-related research are invited to contact the TOEFL program office. Local research may sometimes require access to TOEFL data. In such cases, the program may provide the data following approval by the Research Committee. All TOEFL research projects must undergo appropriate ETS review to ascertain that the confidentiality of data will be protected. Current (1990-91) AbstractThe purpose of this study was to explore the use of two alternative item !esponse theory estimation models in the scaling and equating of TOEFL --a modified one-parameter model (MIPL) and a modified two-parameter model (M2PL) --and to compare item scaling and test equating results based on these two alternative models with results based on the threeparameter model (3PL) that is currently being used to scale and equate the TOEFL. The study employed a design in which a typical TOEFL equating was simulated using artificial data. The simulated equatings were compared in terms of correlations between estimated and generating parameters, model-data fit, and concordance of simulated score conversions with conversions based on the generating parameters.The results of the study dearly indicated that the 3PL model performed better than the MIPL and M2PL models on the basis of each of the evaluation criteria. There was also evidence that the M2PL model performed better than the MIPL model, particularly in terms of model-data fit and in the weighted root mean square difference statistics used to evaluate the simulated score conversions. The results of the study also indicated that discrepancies between score conversions based on the MIPL and M2PL model and those based on the 3PL model tended to occur at the lower and upper ends of the score scales. Finally, the results of the study for the 3PL model indicated that while correlations between item parameter estimates and generating parameters tended to be affected by sample size, neither the quality of model-data fit nor the quality of simulated equatings appeared to be ...

show abstract

Section: Description Of Toefl Equating Designmentioning

confidence: 99%

Section: Description Of Toefl Equating Designmentioning

confidence: 99%

An Investigation of the Use of Simplified Irt Models for Scaling and Equating the Toefl Test

Way

Reese

1990

ETS Research Report Series

View full text Add to dashboard Cite

show abstract

“…McKinley ve Mills (1985), tarafından geliştirilen bu indeks χ 2 olabilirlik oranı olarak adlandırılır. G 2 indeksi Yen'in Q1 indeksine benzemektedir.…”

Section: G 2 İndeksiunclassified

Madde Tepki Kuramı’na Dayalı Madde-Uyum İndekslerinin I.Tip Hata ve Güç Oranlarının İncelenmesi

Sünbül¹,

Aşiret²

2017

Eğitimde Ve Psikolojide Ölçme Ve Değerlendirme Dergisi

View full text Add to dashboard Cite

ÖzBu çalışmada, Madde Tepki Kuramı'na göre ikili puanlanan ve bir, iki ve üç parametreli lojistik modellere uygun olarak üretilen maddelerde, çeşitli madde-uyum indekslerinin, çeşitli koşullardaki (örneklem büyüklüğü, test uzunluğu ve uyumsuzluk yüzdesi) I. tip hata ve güç oranlarının incelenmesi amaçlanmıştır. Çalışmada, indekslerin I. tip hata ve güç oranlarının belirlenmesi simülasyon çalışmasıyla yapılmıştır. Çalışmada, madde uyumu için geleneksel indekslerden χ², Q1 ve G 2 indeksleri ile alternatif indekslerden S-χ² indeksi kullanılmıştır. Çalışmada yer alan dört farklı madde-uyum indeksinin I. tip hata ve güç oranları, örneklem büyüklüğü (1000, 2000, 4000), test uzunluğu (20, 30, 40) ve uyumsuzluk yüzdesi (%0, %10, %30 ve %50) değişimlenerek incelenmiştir. Veriler R 3.3.2 yazılımı kullanılarak üretilmiştir ve "mirt" paketi kullanılarak analiz edilmiştir. Çalışmada üretilen ve analiz edilen model olmak üzere iki tür model kullanılmıştır. Üretilen modele uygun madde tepkileri ile analiz edilen modele uygun madde tepkileri için madde-uyum indekslerinin p değerleri ve serbestlik dereceleri hesaplanmıştır. Uyum indekslerinin I. tip hata ve güç oranları 0.05 anlamlılık düzeyine göre değerlendirilmiştir. Her uyum indeksinin tüm koşullardaki I. tip hata ve güç oranları hesaplanarak bu indeksler karşılaştırılmıştır. Çalışma sonucunda, tüm faktörlerde S-χ² indeksinin diğer indekslere göre daha düşük hataya sahip olduğu görülmüştür. 2000 ve üzeri örneklem büyüklüğünde ve 20 ve daha fazla maddeden oluşan testlerde S-χ² indeksinin diğer indekslerden daha düşük I. tip hata oranına ve daha yüksek güce sahip olduğu görülmüştür.Anahtar Kelimeler: Madde Tepki Kuramı, madde-uyum indeksi I.tip hata, güç, S-χ² AbstractIn this study, it was aimed to investigate type I error and power rates of the item fit indices through various conditions (sample sizes, different test lengths and different magnitudes of misfit) for dichotomously generated items based on one-, two-, and three-parameter logistic models in Item Response Theory. In this study, the type I error and power rates of these item fit indices were assessed in a simulation study. χ², Q1 and G 2 indices as traditional item fit indices and S-χ² index as alternative indices were assessed. The performance of four different item fit indices in study were compared by manipulating three different sample size (1000, 2000, 4000), three different test lengths (20, 30, 40) and four different misfit magnitude (%0, %10, %30 and %50). Item responses were generated using the R 3.3.2 software program and analyzed by using "mirt" package in R software. The p value of item fit indices and their degrees of freedom were calculated for both item responses for generating model and analysis model. Type I errors and power rates of item fit indices were examined according to significance levels of 0.05. All item fit indices in this study were compared by calculating the type I error and power rates of each item fit indices under all conditions. The findings indicated that S-χ² index has lower t...

show abstract

“…Also, as the number of categories of each item was four (={0, 1, 2, 3}), there were four parameters for both models, i.e., one slope and three location parameters Two different numbers of individuals at each generation were prepared (G = 16, 32), and two different sample sizes were arranged (N = 1000, 2000). In addition, the likelihood-ratio chi-square (χ 2 ) statistic (McKinley & Mills, 1985) with degrees of freedom (Q − 1) × j (C j − 1) − 4n, where Q is the number of quadrature points on the latent scale, and four is the number of parameters, was adopted for the fitness evaluation function. There were no substantial differences between the comparison with the information criteria and that with the χ 2 statistic as long as the number of GRM parameters was equal to that for GPCM.…”

Section: Setup For Simulationmentioning

confidence: 99%

Selection of Item Response Model by Genetic Algorithm

Shojima

2007

Behaviormetrika

View full text Add to dashboard Cite

A method of selecting an item response model with a genetic algorithm is proposed, where a model indicator variable is regarded as a chromosome to distinguish other individuals. This scheme enables a model for each item to be selected automatically. The genetic algorithm with the set of techniques that is implemented here is called the simple genetic algorithm, and the results obtained from simulation studies were satisfactory. An issue with the graded response model and the generalized partial credit model was examined using simulation studies and numerical examples was to find which was the more useful of these two prevailing kinds. The results obtained from simulation studies proved the graded response model fit the data more flexibly, since it fit the data generated under the generalized partial credit model more frequently than for the opposite case. However, the generalized partial credit model was more suitable for two real data sets.

show abstract

A Comparison of Several Goodness-of-Fit Statistics

Cited by 114 publications

References 7 publications

An Investigation of the Use of Simplified Irt Models for Scaling and Equating the Toefl Test

An Investigation of the Use of Simplified Irt Models for Scaling and Equating the Toefl Test

Madde Tepki Kuramı’na Dayalı Madde-Uyum İndekslerinin I.Tip Hata ve Güç Oranlarının İncelenmesi

Selection of Item Response Model by Genetic Algorithm

Contact Info

Product

Resources

About