Monitoring Rater Performance Over Time: A Framework for Detecting Differential Accuracy and Differential Scale Category Use

Myford, Carol M.; Wolfe, Edward W.

doi:10.1111/j.1745-3984.2009.00088.x

Cited by 76 publications

(86 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…First, only moderate stability of rater effects (r=0.6) was found across the two monitoring systems, somewhat worryingly suggesting that different impressions of rater performance could be given by the adoption of a particular system. Other studies too have shown instability in rater effects (Baird et al, 2013;Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009), which might be explained by small sample sizes in the monitoring checks.…”

Section: Discussionmentioning

confidence: 95%

“…One of our research questions was whether the different quality assurance systems would produce the same rank-ordering of raters' accuracy, as researchers have recently been interested in the stability of measures of rater effects over time or over subjects (e.g., Congdon & McQueen, 2000;Hoskens & Wilson, 2001;Harik et al, 2009;Lamprianou, 2006;Myford & Wolfe, 2009). The data included 22 examinations, with different styles of scoring rubric.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems

Baird

Meadows

Leckie³

et al. 2015

Assessment in Education: Principles, Policy & Practice

View full text Add to dashboard Cite

General rightsThis document is made available in accordance with publisher policies. Please cite only the published version using the reference above. examinations, giving a total of 5,500 data points. Two rater-monitoring systems (Expert consensus scores and Supervisor judgment of correct scores) were utilized for all raters.Results showed significant group training (table leader) effects upon rater accuracy and these were greater in the expert consensus score monitoring system. When supervisor judgment methods of monitoring were used, differences between training teams (table leader effects) were under-estimated. Supervisor-based judgments of raters' accuracies were more widely dispersed than in the Expert consensus monitoring system. Supervisors not only influenced their teams' scoring accuracies, they over-estimated differences between raters' accuracies, compared with the Expert consensus system. Systems using supervisor judgments of correct scores and face-to-face rater training are, therefore, likely to under-estimate table leader effects and over-estimate rater effects.

show abstract

Section: Discussionmentioning

confidence: 95%

Section: Introductionmentioning

confidence: 99%

Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems

Baird

Meadows

Leckie³

et al. 2015

Assessment in Education: Principles, Policy & Practice

View full text Add to dashboard Cite

show abstract

“…Eckes, 2012Eckes, , 2015Engelhard, 1994;Lumley, 2002;Myford & Wolfe, 2009;Weigle, 1998) eller vurdering relatert til tekstlingvistiske trekk (Freedman, 1979;Fritz & Ruegg, 2013;Wolfe, Song, & Jiao, 2016;Xie, 2015;Ö stlund-Stjärnegårdh, 2002). Vår studie har eit anna utgangspunkt: når oppgåver er sjangerlause, vil ein føresetnad for at vurderingane ikkje skal sprike, vere at ein møter tekstar med same lesemåte.…”

Section: Gb Skar and Aj Aasenunclassified

Risikotrekk og skjulte kvalitetar i elevtekstar

Skar¹,

Aasen²

2016

NJLR

View full text Add to dashboard Cite

Abstract*norskDenne artikkelen undersøker ein elevtekst frå dei nasjonale utvalsprøvene i skriving som ei rekkje ekspertvurderarar har bedømt svaert ulikt. Formålet med analysen er å avdekke vurderingsområde som kan vere krevjande å vurdere når oppgåver ikkje spesifiserar kva sjanger elevane skal skrive. Analysen viser at ein elevtekst kan lesast på ulike måtar, som anten sakprega eller litteraer tekst. Valet av lesemåte verkar inn på kor vidt teksten møter kvalitetskriterium, altså om han blir vurdert som god eller ikkje. Relevansen av ein slik analyse ligg i at denne typen kunnskap har implikasjonar for utvikling av skriveoppgåver, vurderingskriterier og vurderingsrettleiingar.Nøkkelord: Vurdering; vurdering av skriving; tekstanalyse; skriveprøve Abstract*engelskThis article presents an investigation of a student script that has received highly different ratings by a group of expert raters. The purpose of the analysis is to shed light on rating scales that might be difficult to use when prompts in a writing test do not specify which genre the student is supposed to use. The results show that it is possible to read the specific script in two ways, either as fiction or as non-fiction. Choosing to read it in a particular way effects the interpretation of criteria associated with different band levels on different scales. The results are relevant for all working with writing test development, including teachers designing classroom assessment tasks.

show abstract

“…Previous studies indicate that raters' severity or leniency can influence the rating process but their findings vary [2] [3] . With the advancement in personnel measurement, researchers turn to modern psychometrics to identify the root cause of errors and make adjustments to accommodate different types of errors.…”

Section: Introductionmentioning

confidence: 99%

Application of Many-facet Rasch Model in In-basket Tests in China

Ruo-song¹

2015

Proceedings of the International Conference on Education, Management and Information Technology

View full text Add to dashboard Cite

Abstract. The purpose of this paper is to utilize many-facet Rasch model (MFRM) to examine raters' severity/leniency, internal consistency, dimension difficulty, and examinees' ability level in order to further discuss the biases in the assessment center rating. Research questions are tested on a sample of 138 examinees who participated in the in-basket test and 6 raters who rated the test in 2010. The raters were divided into trained and untrained group to finish the work independently. An MFRM analysis was conducted to test the research questions. MFRM can estimate examinees' ability value independently of all types of variances, thus providing insights into the assessment field. MFRM also provides analysis of rater's effects, assessment dimension difficulty, and bias in assessment center test.

show abstract

Monitoring Rater Performance Over Time: A Framework for Detecting Differential Accuracy and Differential Scale Category Use

Cited by 76 publications

References 7 publications

Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems

Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems

Risikotrekk og skjulte kvalitetar i elevtekstar

Application of Many-facet Rasch Model in In-basket Tests in China

Contact Info

Product

Resources

About