For educational tests, it is critical to maintain consistency of score scales and to understand the sources of variation in score means over time. This practice helps to ensure that interpretations about test takers' abilities are comparable from one administration (or one form) to another. This study examines the consistency of reported scores for the TOEIC® Speaking and Writing tests using statistical procedures. Specifically, the stability of the TOEIC Speaking score means from 431 forms administered in a 3‐year period was evaluated using harmonic regression, and the stability of TOEIC Writing score means from 66 forms administered in a 3‐year period was evaluated using analysis of variance. Results indicated that the fluctuations in the TOEIC Speaking or Writing score means mainly reflect changes in test takers' overall English speaking or writing ability levels instead of score inaccuracies. For both speaking and writing test scores, a large proportion of the variation in score means can be explained by seasonality (the rise or fall of score means associated with specific times of the year) and test takers' demographic information, which have been shown to be related to test‐taker ability. As a result, this finding provides evidence for the consistency of the TOEIC Speaking and Writing score scales across forms.
For a testing program with frequent administrations, it is important to understand and monitor the stability and fluctuation of test performance across administrations. Different methods have been proposed for this purpose. This study explored the potential of using multilevel analysis to understand and monitor examinees' test performance across administrations based on their background information. Based on the data of 330,091 examinees' test scores and their background information collected from 254 administrations of an English‐speaking test, the study found: (a) at the individual examinee level, examinees' background had statistically significant relationships with their test performance, and the relationships varied across administrations; however, the prediction of individuals' test scores based on their background variables was not strong, and (b) at the administration level, group composition had strong relationships with administration means; the prediction of administration means based on group composition variables was fairly strong. The results suggest that multilevel analysis has potential application in understanding and monitoring test performance across administrations by identifying statistical relationships between examinees' characteristics and their test performance at both individual and administration levels.
The nature of the matching criterion (usually the total score) in the study of differential item functioning (DIF) has been shown to impact the accuracy of different DIF detection procedures. One of the topics related to the nature of the matching criterion is whether the studied item should be included. Although many studies exist that suggest the studied item should always be included in the criterion, the validity of this statement for models other than the Rasch model has not been studied. This study evaluates the effect of including/excluding the studied item in the matching criterion for situations that mimic real testing situations where the assumptions of the Rasch model are violated. A simulation study was conducted where the effect of including/excluding the studied item in the matching criterion was studied relative to different magnitudes of DIF and different group ability distributions, for data that follow the two‐parameter logistic (2PL) item response theory (IRT) and multidimensional item response theory (MIRT) models. Results from the study show that including the studied item leads to less biased DIF estimates and more appropriate Type I error rate, especially when the group ability distributions are different. Systematic bias positioning DIF estimation in favor of the high ability group was consistently found across all simulated conditions when the studied item was excluded from the matching criterion.
Various subscore estimation methods that use auxiliary information to improve subscore accuracy and stability have been developed. This report provides a review of various subscore estimation methods described in the literature. The methodology of each method is described, then research studies on these subscore estimation methods are summarized. Comments on the methods and suggestions for future areas of research are provided, and preliminary guidelines for using subscore estimation methods in practice are recommended.
This paper reports on a study conducted to investigate the consistency of the results between 2 approaches to estimating school effectiveness through value-added modeling. Estimates of school effects from the layered model employing item response theory (IRT) scaled data are compared to estimates derived from a discrete growth model based on the analysis of transitions along an ordinal developmental scale. The data were extracted from the longitudinal records maintained in the Early Childhood Longitudinal Study-Kindergarten Cohort (ECLS-K) archive for students remaining in the same school from the beginning of kindergarten through the end of Grade 3. The results of different comparisons indicated that the estimates from the 2 approaches are moderately consistent.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.