Music performance assessments frequently include panels of raters who evaluate the quality of musical performances using rating scales. As a result of practical considerations, it is often not possible to obtain ratings from every rater on every performance (i.e., complete rating designs). When there are differences in rater severity, and not all raters rate all performances, ratings of musical performances and their resulting classification (e.g., pass or fail) depend on the “luck of the rater draw.” In this study, we explored the implications of different types of incomplete rating designs for the classification of musical performances in rater-mediated musical performance assessments. We present a procedure that researchers and practitioners can use to adjust student scores for differences in rater severity when incomplete rating designs are used, and we consider the effects of the adjustment procedure across different types of rating designs. Our results suggested that differences in rater severity have large practical consequences for ratings of musical performances that impact individual students and group of students differently. Furthermore, our findings suggest that it is possible to adjust musical performance ratings for differences in rater severity as long as there are common raters across scoring panels. We consider the implications of our findings as they relate to music assessment research and practice.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.