Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning

Wesolowski, Brian C.; Wind, Stefanie A.; Engelhard, George

doi:10.1177/1029864915589014

Cited by 41 publications

(57 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This could explain why the agreement was so low between raters 1 and 3, ICC (3, 2) = 0.31. Considering how common it is for expert raters to disagree when rating music performance quality (Wapnick et al, 2005; Wesolowski et al, 2015), a future study should certainly use a scale with published psychometric data.…”

Section: Limitations and Future Directionsmentioning

confidence: 99%

Acceptance and Commitment Therapy for the Treatment of Music Performance Anxiety: A Pilot Study with Student Vocalists

et al. 2017

View full text Add to dashboard Cite

This study investigated the use of Acceptance and Commitment Therapy (ACT) as a treatment for music performance anxiety (MPA) in an uncontrolled pilot design. ACT is a newer, “third-wave” therapy that differs from previous MPA treatments, because its goal is not to reduce symptoms of MPA. Rather, ACT aims to enhance psychological flexibility in the presence of unwanted symptoms through the promotion of six core processes collectively known as the ACT “Hexaflex.” A small group of student vocalists (N = 7) from an elite choral college were recruited using objective criteria for evaluating MPA. Participants received 12 ACT sessions, and their baseline functioning served as a pre-treatment control. Treatment consisted of an orientation to ACT, identifying experientially avoidant behaviors, facilitation of Hexaflex processes, group performances in which valued behaviors were practiced in front of one another, meditations, homework, and completion of self-report measures before, during, and after treatment (at a 1- and 3-month follow-up). Improvements were observed in participants' cognitive defusion, acceptance of MPA symptoms, and psychological flexibility at post-treatment and follow-ups. Students also appeared to improve their performance quality and reduce their shame over having MPA. These results add to existing research suggesting ACT is a promising intervention for MPA, while also highlighting how vocal students may be less impaired by physical MPA symptoms.

show abstract

Section: Limitations and Future Directionsmentioning

confidence: 99%

Acceptance and Commitment Therapy for the Treatment of Music Performance Anxiety: A Pilot Study with Student Vocalists

et al. 2017

View full text Add to dashboard Cite

show abstract

“…As a result, the between-subgroup outfit statistics appear to be useful tools for identifying systematic differences in the levels of severity that a rater exercises when assessing various subgroups. In contrast to other popular methods for detecting potential rater biases, such as bias/interaction analyses (e.g., Engelhard, 2008;Goodwin, 2016;Kondo-Brown, 2002;Springer & Bradley, 2018;Wesolowski et al, 2015;Winke et al, 2012), practitioners do not need to make multiple comparisons when interpreting the meaning of rater betweensubgroup outfit statistics. In this article, we have argued that practitioners evaluating performance assessments should consider reporting rater between-subgroup outfit statistics in addition to rater total fit statistics when providing evidence of the fairness of those assessments.…”

Section: Resultsmentioning

confidence: 99%

“…r Kondo-Brown (2002) r Schaefer (2008), Wesolowski et al (2015) r Wind and Engelhard (2012) Magnitude of the difference between the levels of severity that individual raters exhibited when assessing the students in the focal and reference subgroups…”

Section: Simulated Datamentioning

confidence: 99%

“…In previous studies of rater-mediated assessments, researchers have proposed numerous indicators of rating quality that reflect various perspectives on what constitutes evidence of high-quality ratings, such as indicators of rater consistency (i.e., agreement or reliability) or fit to a measurement model (e.g., Meadows & Billington, 2005;Myford & Wolfe, 2003, 2004. As part of evaluating the fairness of ratermediated assessments, many researchers have studied differential rater functioning (DRF), or raters' tendency to apply inconsistent levels of severity when they assess students in different subgroups (Engelhard, 2008;Goodwin, 2016;Kondo-Brown, 2002;Springer & Bradley, 2018;Wesolowski, Wind, & Engelhard, 2015;Winke, Gass, & Myford, 2012). When raters exhibit DRF, they may systematically underestimate or overestimate student achievement, depending on students' membership within a subgroup.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Examining Differential Rater Functioning Using a Between‐Subgroup Outfit Approach

Wind

Sebok‐Syer²

2019

J Educational Measurement

Self Cite

View full text Add to dashboard Cite

When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters' ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between-subgroup fit approach. We used standardized between-subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups.

show abstract

“…Across performance assessment contexts in general, raters' schemata vary in the use of evaluation cues and the cognitive processes by which the scoring is based, causing fundamental validity concerns with the misconception that observed scores are "measures" (Wolfe, 1997). Under quantitative marking schemes, content-expert raters are vulnerable to their own heuristics guided by decision-making processes, causing 78 construct-irrelevant variability in the scoring process (Wesolowski, Wind, & Engelhard, 2015). These concerns also apply to the context of music performance assessment.…”

mentioning

confidence: 99%

Evaluating Differential Rater Functioning Over Time in the Context of Solo Music Performance Assessment

Wesolowski

Wind

Engelhard

2017

Bulletin of the Council for Research in Music Education

Self Cite

View full text Add to dashboard Cite

Rater variability studies in the context of music performance assessment treat rater effects as static characteristics of raters, where the effects occur similarly across each assessed performance. The purpose of this study was to investigate expert raters’ (N = 13) differential severity/leniency as dynamic processes, where the rater effects occur over time. In particular, we sought to examine the manifestation of group and individual variability using a class of rater effects referred to as differential rater functioning over time (DRIFT). DRIFT refers to the changes in rater performance in relation to a parameter of time. Three classes of Multifaceted Rasch (MFR) models were specified in order to explore differences in raters’ systematic changes in their interpretation of a 4-point rating scale structure across a 5-day rating session: (a) time-static model, (b) rater-by-time interaction model, and (c) partial credit model for time points. Results indicated a significant difference in severity/leniency across time for both the group of raters as a whole and some individual raters. Overall, raters demonstrated a general trend of decreasing severity over the 5-day rating session. Interaction analyses suggested that differential severity/leniency existed for both the raters as a group and for 9 out of the 13 individual raters. Of the total 65 potential pairwise interaction terms examined between raters and days, 21 (33.31%) were found to be statistically significant. Ten interactions systematically underestimated the performances and 11 interactions systematically overestimated the performances. Implications for the improved fairness of ratings in music assessment contexts are discussed.

show abstract

Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning

Cited by 41 publications

References 54 publications

Acceptance and Commitment Therapy for the Treatment of Music Performance Anxiety: A Pilot Study with Student Vocalists

Acceptance and Commitment Therapy for the Treatment of Music Performance Anxiety: A Pilot Study with Student Vocalists

Examining Differential Rater Functioning Using a Between‐Subgroup Outfit Approach

Evaluating Differential Rater Functioning Over Time in the Context of Solo Music Performance Assessment

Contact Info

Product

Resources

About