The Stability of Rater Severity in Large‐Scale Assessment Programs

Congdon, Peter; MeQueen, Joy

doi:10.1111/j.1745-3984.2000.tb01081.x

Cited by 84 publications

(87 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…First, only moderate stability of rater effects (r=0.6) was found across the two monitoring systems, somewhat worryingly suggesting that different impressions of rater performance could be given by the adoption of a particular system. Other studies too have shown instability in rater effects (Baird et al, 2013;Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009), which might be explained by small sample sizes in the monitoring checks.…”

Section: Discussionmentioning

confidence: 94%

“…One of our research questions was whether the different quality assurance systems would produce the same rank-ordering of raters' accuracy, as researchers have recently been interested in the stability of measures of rater effects over time or over subjects (e.g., Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009). The data included 22 examinations, with different styles of scoring rubric.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems

Baird

Meadows

Leckie³

et al. 2015

Assessment in Education: Principles, Policy & Practice

View full text Add to dashboard Cite

General rightsThis document is made available in accordance with publisher policies. Please cite only the published version using the reference above. examinations, giving a total of 5,500 data points. Two rater-monitoring systems (Expert consensus scores and Supervisor judgment of correct scores) were utilized for all raters.Results showed significant group training (table leader) effects upon rater accuracy and these were greater in the expert consensus score monitoring system. When supervisor judgment methods of monitoring were used, differences between training teams (table leader effects) were under-estimated. Supervisor-based judgments of raters' accuracies were more widely dispersed than in the Expert consensus monitoring system. Supervisors not only influenced their teams' scoring accuracies, they over-estimated differences between raters' accuracies, compared with the Expert consensus system. Systems using supervisor judgments of correct scores and face-to-face rater training are, therefore, likely to under-estimate table leader effects and over-estimate rater effects.

show abstract

Section: Discussionmentioning

confidence: 94%

Section: Introductionmentioning

confidence: 99%

Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems

Baird

Meadows

Leckie³

et al. 2015

Assessment in Education: Principles, Policy & Practice

View full text Add to dashboard Cite

show abstract

“…4 As noted above, rater training tends to emphasize agreement (Congdon & McQueen, 2000;Quellmalz, 1985), which from the perspective of SDT involves in part an attempt to get raters to use similar response criteria; research has shown, however, that this is difficult to do, even with extensive training. Moreover, because the criteria have little effect on classification accuracy, the focus on agreement is in some ways misdirected.…”

Section: Some Notes On Rater Trainingmentioning

confidence: 99%

“…For example, in the context of performance assessment in education, Congdon and McQueen (2000) noted that "Presumably, the rater training which is a common feature of rating programs is in part intended to maximize inter-rater agreement. However, even extensive training has little effect on the standards maintained by raters..." (p. 164).…”

Section: Some Notes On Rater Trainingmentioning

confidence: 99%

A Latent Class Extension of Signal Detection Theory, with Applications

DeCarlo¹

2002

Multivariate Behavioral Research

View full text Add to dashboard Cite

A latent class extension of signal detection theory is presented and applications are illustrated. The approach is useful for situations where observers attempt to detect latent categorical events or where the goal of the analysis is to select or classify cases. Signal detection theory is shown to offer a simple summary of the observers' performance in terms of detection and response criteria. Implications of the view via signal detection for the training of raters are noted, as are approaches to validating the parameters and classifications. An extension of the signal detection model to more than two latent classes, with a simple restriction on the detection parameters, is introduced. Sample programs to fit the models using software for latent class analysis or software for second generation structural equation modeling are provided.In many situations in psychology, education, and medicine, observers attempt to detect or discriminate between two or more classes of events. When the events are observable, the methods of signal detection theory (SDT) can be used to obtain a measure of an observer's ability to detect or discriminate (see Macmillan & Creelman, 1991;Swets, 1996). Consider, for example, a simple test of recognition memory. A list of words can be presented during a study period and a combination of old words (from the list) and new words can be presented during a subsequent test, with the observers' task being to decide whether each word is old or new, or to rate their confidence that a word is old or new. From the perspective of SDT, the effect of an event (an old or new word) can be represented by a continuous latent variable, usually interpreted as an observer's perception of the event (e.g., the familiarity of the word), which is used together with a response criterion to arrive at a decision of old or new. The results can be summarized by a detection parameter, say d, which for the memory example is a measure of recognition memory strength, and by a parameter c that indicates the location of the response criterion; the location of the criterion is viewed as

show abstract

“…Congdon & McQueen, 2000;Lim, 2011;Moser, Sudweeks, Morrison, & Wilcox, 2014), eller så fokuseras metoder fö r att identifiera och eller åtgärda bristande samstämmighet (t.ex. Attali, 2014;Gü ler, 2014).…”

Section: Att Mä Ta Bedö Marreliabilitetunclassified

Samstämmighet i lärares bedömning av nationella prov i läsförståelse

Tengberg

Skar

2016

NJLR

View full text Add to dashboard Cite

AbstraktTillfö rlitlighet i bedö mning är en avgö rande komponent i varje testprogram där testtagares resultat bygger på bedö mares tolkningar utifrån en bedö mningsskala eller en bedö mningsguide. Utfö rliga svar på ö ppna uppgifter bedö ms exempelvis sällan som antingen ''rätt'' eller ''fel''. Istället tillämpas skalan eller bedö mningsguiden fö r att fastställa i vilken utsträckning svaret uppvisar den efterfrågade kompetensen. I den här artikeln redovisas resultat från en studie av bedö marreliabilitet på ö ppna uppgifter i det nationella provets svenska läsfö rståelsedel i årskurs nio.Fö r att undersö ka i vilken mån provsystemet skapar fö rutsättningar fö r god bedö marreliabilitet har sex lärare fått bedö ma tre elevers lö sningar av 14 ö ppna uppgifter, totalt 252 bedö mningar. Analyserna innefattar konsensusestimat (procentuell samstämmighet och Cohens kappa) och konsistensestimat (ICC). Dessutom har kvalitativa analyser genomfö rts på uppgiftsnivå fö r att visa på aspekter i uppgiftskonstruktionen som kan ligga till grund fö r låg bedö marreliabilitet.Resultaten från studien visar på moderata nivåer av bedö marreliabilitet, både ifråga om kappavärden (.73) och ICC (.82), vilket motsvarar en variation mellan bedö mningarna som får stora konsekvenser fö r elevernas slutgiltiga provresultat. I artikeln diskuterar vi resultatens implikationer fö r rättvis bedö mning av elevers läsfö rmåga i Sverige. Vi fö r också ett resonemang om olika sätt att stärka bedö marreliabiliteten det nationella provet i läsfö rståelse.Nyckelord: bedömning; interbedömarreliabilitet; läsning; nationella prov; reliabilitet Abstract Inter-rater reliability is a critical component in any test program where test-takers' responses are judged by human raters using scales or scoring rubrics. Lengthy responses to open-ended test items are, for instance, rarely judged objectively as either ''correct'' or ''incorrect''. Rather, rubrics are used to determine the extent to which a particular item response displays the expected competence. This paper reports a study of inter-rater reliability in teachers' assessment of open-ended items in the Swedish national reading test for 9 th grade. In order to explore whether the test design supports reliable assessment, six experienced teachers of Swedish were asked to rate the responses of three students on 14 items, 252 ratings in all. Analyses included consensus estimates (percent agreement and Cohen's kappa) and consistency estimates (ICC). In addition, qualitative item analyses were performed in order to investigate possible causes of low reliability for specific items.Findings indicate moderate levels of inter-rater reliability according to both kappa (.73) and ICC (.82) values, equaling a variation of ratings with large consequences for the students' final results. *Correspondence to: Michael Tengberg, Institutionen fö r pedagogiska studier, Karlstads universitet, 65188 Karlstad, Sverige.

show abstract

The Stability of Rater Severity in Large‐Scale Assessment Programs

Cited by 84 publications

References 18 publications

Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems

Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems

A Latent Class Extension of Signal Detection Theory, with Applications

Samstämmighet i lärares bedömning av nationella prov i läsförståelse

Contact Info

Product

Resources

About