2009
DOI: 10.1111/j.1745-3984.2009.00088.x
|View full text |Cite
|
Sign up to set email alerts
|

Monitoring Rater Performance Over Time: A Framework for Detecting Differential Accuracy and Differential Scale Category Use

Abstract: In this study, we describe a framework for monitoring rater performance over time. We present several statistical indices to identify raters whose standards drift and explain how to use those indices operationally. To illustrate the use of the framework, we analyzed rating data from the 2002 Advanced Placement English Literature and Composition examination, employing a multifaceted Rasch approach to determine whether raters exhibited evidence of two types of differential rater functioning over time (i.e., chan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
75
0
5

Year Published

2014
2014
2019
2019

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 76 publications
(86 citation statements)
references
References 7 publications
2
75
0
5
Order By: Relevance
“…First, only moderate stability of rater effects (r=0.6) was found across the two monitoring systems, somewhat worryingly suggesting that different impressions of rater performance could be given by the adoption of a particular system. Other studies too have shown instability in rater effects (Baird et al, 2013;Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009), which might be explained by small sample sizes in the monitoring checks.…”
Section: Discussionmentioning
confidence: 95%
See 1 more Smart Citation
“…First, only moderate stability of rater effects (r=0.6) was found across the two monitoring systems, somewhat worryingly suggesting that different impressions of rater performance could be given by the adoption of a particular system. Other studies too have shown instability in rater effects (Baird et al, 2013;Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009), which might be explained by small sample sizes in the monitoring checks.…”
Section: Discussionmentioning
confidence: 95%
“…One of our research questions was whether the different quality assurance systems would produce the same rank-ordering of raters' accuracy, as researchers have recently been interested in the stability of measures of rater effects over time or over subjects (e.g., Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009). The data included 22 examinations, with different styles of scoring rubric.…”
Section: Introductionmentioning
confidence: 99%
“…Eckes, 2012Eckes, , 2015Engelhard, 1994;Lumley, 2002;Myford & Wolfe, 2009;Weigle, 1998) eller vurdering relatert til tekstlingvistiske trekk (Freedman, 1979;Fritz & Ruegg, 2013;Wolfe, Song, & Jiao, 2016;Xie, 2015;Ö stlund-Stjärnegårdh, 2002). Vår studie har eit anna utgangspunkt: når oppgåver er sjangerlause, vil ein føresetnad for at vurderingane ikkje skal sprike, vere at ein møter tekstar med same lesemåte.…”
Section: Gb Skar and Aj Aasenunclassified
“…Previous studies indicate that raters' severity or leniency can influence the rating process but their findings vary [2] [3] . With the advancement in personnel measurement, researchers turn to modern psychometrics to identify the root cause of errors and make adjustments to accommodate different types of errors.…”
Section: Introductionmentioning
confidence: 99%