2015
DOI: 10.1177/1029864915589014
|View full text |Cite
|
Sign up to set email alerts
|

Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning

Abstract: The purpose of this study was to investigate model-data fit and differential rater functioning in the context of large group music performance assessment using the Many-Facet Rasch Partial Credit Measurement Model. In particular, we sought to identify whether or not expert raters' (N = 24) severity was invariant across four school levels (middle school, high school, collegiate, professional). Interaction analyses suggested that differential rater functioning existed for both the group of raters and some indivi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

1
56
0

Year Published

2016
2016
2025
2025

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 41 publications
(57 citation statements)
references
References 54 publications
1
56
0
Order By: Relevance
“…This could explain why the agreement was so low between raters 1 and 3, ICC (3, 2) = 0.31. Considering how common it is for expert raters to disagree when rating music performance quality (Wapnick et al, 2005; Wesolowski et al, 2015), a future study should certainly use a scale with published psychometric data.…”
Section: Limitations and Future Directionsmentioning
confidence: 99%
“…This could explain why the agreement was so low between raters 1 and 3, ICC (3, 2) = 0.31. Considering how common it is for expert raters to disagree when rating music performance quality (Wapnick et al, 2005; Wesolowski et al, 2015), a future study should certainly use a scale with published psychometric data.…”
Section: Limitations and Future Directionsmentioning
confidence: 99%
“…As a result, the between-subgroup outfit statistics appear to be useful tools for identifying systematic differences in the levels of severity that a rater exercises when assessing various subgroups. In contrast to other popular methods for detecting potential rater biases, such as bias/interaction analyses (e.g., Engelhard, 2008;Goodwin, 2016;Kondo-Brown, 2002;Springer & Bradley, 2018;Wesolowski et al, 2015;Winke et al, 2012), practitioners do not need to make multiple comparisons when interpreting the meaning of rater betweensubgroup outfit statistics. In this article, we have argued that practitioners evaluating performance assessments should consider reporting rater between-subgroup outfit statistics in addition to rater total fit statistics when providing evidence of the fairness of those assessments.…”
Section: Resultsmentioning
confidence: 99%
“…r Kondo-Brown (2002) r Schaefer (2008), Wesolowski et al (2015) r Wind and Engelhard (2012) Magnitude of the difference between the levels of severity that individual raters exhibited when assessing the students in the focal and reference subgroups…”
Section: Simulated Datamentioning
confidence: 99%
See 1 more Smart Citation
“…Across performance assessment contexts in general, raters' schemata vary in the use of evaluation cues and the cognitive processes by which the scoring is based, causing fundamental validity concerns with the misconception that observed scores are "measures" (Wolfe, 1997). Under quantitative marking schemes, content-expert raters are vulnerable to their own heuristics guided by decision-making processes, causing 78 construct-irrelevant variability in the scoring process (Wesolowski, Wind, & Engelhard, 2015). These concerns also apply to the context of music performance assessment.…”
mentioning
confidence: 99%