Yanxuan Qu scite author profile

For educational tests, it is critical to maintain consistency of score scales and to understand the sources of variation in score means over time. This practice helps to ensure that interpretations about test takers' abilities are comparable from one administration (or one form) to another. This study examines the consistency of reported scores for the TOEIC® Speaking and Writing tests using statistical procedures. Specifically, the stability of the TOEIC Speaking score means from 431 forms administered in a 3‐year period was evaluated using harmonic regression, and the stability of TOEIC Writing score means from 66 forms administered in a 3‐year period was evaluated using analysis of variance. Results indicated that the fluctuations in the TOEIC Speaking or Writing score means mainly reflect changes in test takers' overall English speaking or writing ability levels instead of score inaccuracies. For both speaking and writing test scores, a large proportion of the variation in score means can be explained by seasonality (the rise or fall of score means associated with specific times of the year) and test takers' demographic information, which have been shown to be related to test‐taker ability. As a result, this finding provides evidence for the consistency of the TOEIC Speaking and Writing score scales across forms.

Using Multilevel Analysis to Monitor Test Performance Across Administrations

Wei

2014

For a testing program with frequent administrations, it is important to understand and monitor the stability and fluctuation of test performance across administrations. Different methods have been proposed for this purpose. This study explored the potential of using multilevel analysis to understand and monitor examinees' test performance across administrations based on their background information. Based on the data of 330,091 examinees' test scores and their background information collected from 254 administrations of an English‐speaking test, the study found: (a) at the individual examinee level, examinees' background had statistically significant relationships with their test performance, and the relationships varied across administrations; however, the prediction of individuals' test scores based on their background variables was not strong, and (b) at the administration level, group composition had strong relationships with administration means; the prediction of administration means based on group composition variables was fairly strong. The results suggest that multilevel analysis has potential application in understanding and monitoring test performance across administrations by identifying statistical relationships between examinees' characteristics and their test performance at both individual and administration levels.

The Value of the Studied Item in the Matching Criterion in Differential Item Functioning (Dif) Analysis

Tan

Xiang

Dorans

et al. 2010

The nature of the matching criterion (usually the total score) in the study of differential item functioning (DIF) has been shown to impact the accuracy of different DIF detection procedures. One of the topics related to the nature of the matching criterion is whether the studied item should be included. Although many studies exist that suggest the studied item should always be included in the criterion, the validity of this statement for models other than the Rasch model has not been studied. This study evaluates the effect of including/excluding the studied item in the matching criterion for situations that mimic real testing situations where the assumptions of the Rasch model are violated. A simulation study was conducted where the effect of including/excluding the studied item in the matching criterion was studied relative to different magnitudes of DIF and different group ability distributions, for data that follow the two‐parameter logistic (2PL) item response theory (IRT) and multidimensional item response theory (MIRT) models. Results from the study show that including the studied item leads to less biased DIF estimates and more appropriate Type I error rate, especially when the group ability distributions are different. Systematic bias positioning DIF estimation in favor of the high ability group was consistently found across all simulated conditions when the studied item was excluded from the matching criterion.

A Review of Subscore Estimation Methods

2018

Various subscore estimation methods that use auxiliary information to improve subscore accuracy and stability have been developed. This report provides a review of various subscore estimation methods described in the literature. The methodology of each method is described, then research studies on these subscore estimation methods are summarized. Comments on the methods and suggestions for future areas of research are provided, and preliminary guidelines for using subscore estimation methods in practice are recommended.

Robustness of a Value‐added Assessment of School Effectiveness

Braun

Qu²,

Trapani³

2008

This paper reports on a study conducted to investigate the consistency of the results between 2 approaches to estimating school effectiveness through value-added modeling. Estimates of school effects from the layered model employing item response theory (IRT) scaled data are compared to estimates derived from a discrete growth model based on the analysis of transitions along an ordinal developmental scale. The data were extracted from the longitudinal records maintained in the Early Childhood Longitudinal Study-Kindergarten Cohort (ECLS-K) archive for students remaining in the same school from the beginning of kindergarten through the end of Grade 3. The results of different comparisons indicated that the estimates from the 2 approaches are moderately consistent.