The purpose of this study was to investigate model-data fit and differential rater functioning in the context of large group music performance assessment using the Many-Facet Rasch Partial Credit Measurement Model. In particular, we sought to identify whether or not expert raters' (N = 24) severity was invariant across four school levels (middle school, high school, collegiate, professional). Interaction analyses suggested that differential rater functioning existed for both the group of raters and some individual raters based on their expected locations on the logit scale. This indicates that expert raters did not demonstrate invariant levels of severity when rating subgroups of ensembles across the four school levels. Of the 92 potential pairwise interactions examined, 14 (15.2%) interactions were found to be statistically significant, indicating that 10 individual raters demonstrated differential severity across at least one school level. Interpretations of meaningful systematic patterns emerged for some raters after investigating individual pairwise interactions. Implications for improving the fairness and equity in large group music performance evaluations are discussed.
Mokken scale analysis (MSA) is a probabilistic‐nonparametric approach to item response theory (IRT) that can be used to evaluate fundamental measurement properties with less strict assumptions than parametric IRT models. This instructional module provides an introduction to MSA as a probabilistic‐nonparametric framework in which to explore measurement quality, with an emphasis on its application in the context of educational assessment. The module describes both dichotomous and polytomous formulations of the MSA model. Examples of the application of MSA to educational assessment are provided using data from a multiple‐choice physical science assessment and a rater‐mediated writing assessment.
The use of assessments that require rater judgment (i.e., rater-mediated assessments) has become increasingly popular in high-stakes language assessments worldwide. Using a systematic literature review, the purpose of this study is to identify and explore the dominant methods for evaluating rating quality within the context of research on large-scale rater-mediated language assessments. Results from the review of 259 methodological and applied studies reveal an emphasis on inter-rater reliability as evidence of rating quality that persists across methodological and applied studies, studies primarily focused on rating quality and studies not primarily focused on rating quality, and across multiple language constructs. Additional findings suggest discrepancies in rating designs used in empirical research and practical concerns in performance assessment systems. Taken together, the findings from this study highlight the reliance upon aggregate-level information that is not specific to individual raters or specific facets of an assessment context as evidence of rating quality in rater-mediated assessments. In order to inform the interpretation and use of ratings, as well as the improvement of rater-mediated assessment systems, rating quality indices are needed that go beyond group-level indicators of inter-rater reliability, and provide diagnostic evidence of rating quality specific to individual raters, students, and other facets of the assessment system. These indicators are available based on modern measurement techniques, such as Rasch measurement theory and other item response theory approaches. Implications are discussed as they relate to validity, reliability/precision, and fairness for rater-mediated assessments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.