2000
DOI: 10.1111/j.1745-3984.2000.tb01081.x
|View full text |Cite
|
Sign up to set email alerts
|

The Stability of Rater Severity in Large‐Scale Assessment Programs

Abstract: The purpose of this study was to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of 16 raters on writing performances of 8, 285 elementary school students. Each performance was rated by two trained raters over a period of seven rating days. Performances rated on the first day were re‐rated at the end of the rating period. Statistically significant differences between raters were found within each day and in all days combined. Daily … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
77
0
6

Year Published

2002
2002
2019
2019

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 84 publications
(87 citation statements)
references
References 18 publications
4
77
0
6
Order By: Relevance
“…First, only moderate stability of rater effects (r=0.6) was found across the two monitoring systems, somewhat worryingly suggesting that different impressions of rater performance could be given by the adoption of a particular system. Other studies too have shown instability in rater effects (Baird et al, 2013;Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009), which might be explained by small sample sizes in the monitoring checks.…”
Section: Discussionmentioning
confidence: 94%
See 1 more Smart Citation
“…First, only moderate stability of rater effects (r=0.6) was found across the two monitoring systems, somewhat worryingly suggesting that different impressions of rater performance could be given by the adoption of a particular system. Other studies too have shown instability in rater effects (Baird et al, 2013;Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009), which might be explained by small sample sizes in the monitoring checks.…”
Section: Discussionmentioning
confidence: 94%
“…One of our research questions was whether the different quality assurance systems would produce the same rank-ordering of raters' accuracy, as researchers have recently been interested in the stability of measures of rater effects over time or over subjects (e.g., Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009). The data included 22 examinations, with different styles of scoring rubric.…”
Section: Introductionmentioning
confidence: 99%
“…4 As noted above, rater training tends to emphasize agreement (Congdon & McQueen, 2000;Quellmalz, 1985), which from the perspective of SDT involves in part an attempt to get raters to use similar response criteria; research has shown, however, that this is difficult to do, even with extensive training. Moreover, because the criteria have little effect on classification accuracy, the focus on agreement is in some ways misdirected.…”
Section: Some Notes On Rater Trainingmentioning
confidence: 99%
“…For example, in the context of performance assessment in education, Congdon and McQueen (2000) noted that "Presumably, the rater training which is a common feature of rating programs is in part intended to maximize inter-rater agreement. However, even extensive training has little effect on the standards maintained by raters..." (p. 164).…”
Section: Some Notes On Rater Trainingmentioning
confidence: 99%
“…Congdon & McQueen, 2000;Lim, 2011;Moser, Sudweeks, Morrison, & Wilcox, 2014), eller så fokuseras metoder fö r att identifiera och eller åtgärda bristande samstämmighet (t.ex. Attali, 2014;Gü ler, 2014).…”
Section: Att Mä Ta Bedö Marreliabilitetunclassified