2014
DOI: 10.1002/ets2.12029
|View full text |Cite
|
Sign up to set email alerts
|

Using Multilevel Analysis to Monitor Test Performance Across Administrations

Abstract: For a testing program with frequent administrations, it is important to understand and monitor the stability and fluctuation of test performance across administrations. Different methods have been proposed for this purpose. This study explored the potential of using multilevel analysis to understand and monitor examinees' test performance across administrations based on their background information. Based on the data of 330,091 examinees' test scores and their background information collected from 254 administ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

2
6
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(8 citation statements)
references
References 15 publications
2
6
0
Order By: Relevance
“…The growth modeling results indicate that (a) the repeaters' fitted score means based on the models were consistent with their observed score means and (b) the growth parameters based on two different samples were very close to each other. This suggests that growth modeling is very promising for predicting repeaters' score means at the group level, which is consistent with findings from other studies (Wei, ; Wei & Qu, ). Therefore we can monitor test performance by comparing repeaters' observed and predicted score means.…”
Section: Discussionsupporting
confidence: 91%
See 1 more Smart Citation
“…The growth modeling results indicate that (a) the repeaters' fitted score means based on the models were consistent with their observed score means and (b) the growth parameters based on two different samples were very close to each other. This suggests that growth modeling is very promising for predicting repeaters' score means at the group level, which is consistent with findings from other studies (Wei, ; Wei & Qu, ). Therefore we can monitor test performance by comparing repeaters' observed and predicted score means.…”
Section: Discussionsupporting
confidence: 91%
“…Across‐administration test quality control may include the evaluation of the fluctuation of score summary statistics, population composition and background changes, test content evolution and difficulty shift, equating errors and scale drift, and the stability of psychometric properties such as reliability and validity. Various methods and procedures have been proposed for this purpose, such as time series analysis (Li, Li, & von Davier, ), harmonic regression (Lee & Haberman, ), linear mixed effects modeling (Lee, Liu, & von Davier, ), Shewhart control charts (see a brief description in von Davier, ), hidden Markov modeling (Lee & von Davier, ), and multilevel analysis (Wei, ; Wei & Qu, ).…”
mentioning
confidence: 99%
“…However, this 16% of variance often represents what is observed in real testing situations. For example, in a large‐scale English language test, Wei and Qu () observed that the variance of total test score that could be explained by selected background variables varied from 1.8% to 21.2%. PEG linking is expected to perform better and be closer to that of NEAT equating when more variables are included in the scenario of large group difference in ability.…”
Section: Summary and Discussionmentioning
confidence: 99%
“…For example, Puhan (2008) considered two equating designs, one with parallel chains and the other with a single long chain; for each equating design, scale stability was assessed by comparing the equating functions produced from different chains. The second type of approaches utilizes current and historical data of an assessment in a statistical model to identify possible sources of administration‐to‐administration variability in test scores that may be easily explained, and then assess the unexplained variability in the scores (e.g., Haberman et al., 2008; Lee & Haberman, 2013; Lee & von Davier, 2013; Lee, Liu, & von Davier, 2014; Liu & Yoo, 2019; Qu et al., 2017; Wei & Qu, 2014). Thorough evaluation of the identified sources of variability and the degree and patterns of the unexplained variability can then suggest score (in)stability.…”
Section: Introductionmentioning
confidence: 99%
“…Among the second type of approaches mentioned above, some analyzed examinee‐level data (e.g., Lee et al., 2014; Wei & Qu, 2014) and others analyzed administration‐level data (e.g., Haberman et al., 2008; Lee & Haberman, 2013; Lee & von Davier, 2013; Liu & Yoo, 2019; Qu et al., 2017). The former approaches studied the effects of examinee‐specific demographic data on individual scores, including a linear mixed‐effects model applied to 15 administrations structured by a special equating design (Lee et al., 2014) or a multilevel analysis applied to 254 administrations of a test in 4 years (Wei & Qu, 2014). Both studies examined many demographic variables 1 but did not consider any types of seasonality in test scores.…”
Section: Introductionmentioning
confidence: 99%