Estimating Individual Rater Reliabilities

Overall, John E.; Magee, Kevin N.

doi:10.1177/014662169201600109

Cited by 10 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Van den Bergh & Eiting (1989) assumed multiple quantitative ratings to be congeneric, tau-equivalent, or parallel and then used LISREL (Joreskog & Sorbom, 1988) to fit these models. Overall & Magee (1992) proposed several simple models, such as the disattenuation model, the common factor model, the external criterion model, the treatment effects model, and the regression model, to estimate individual reliabilities of raters from simple bivariate correlations among their ratings. Item response modeling focuses on rater severity as an important aspect of rater consistency that needs to be examined.…”

Section: Three Issuesmentioning

confidence: 99%

Complex Composites: Issues That Arise in Combining Different Modes of Assessment

Wilson

Wang

1995

Applied Psychological Measurement

View full text Add to dashboard Cite

Data from the California Learning Assessment System are used to examine certain characteristics of tests designed as the composites of items of different modes. The characteristics include rater severity, test information, and definition of the latent variable. Three different assessment modes-multiple-choice, open-ended, and investigation items (the latter two are referred to as performance-based modes)-were combined in a test across three different test forms. Rater severity was investigated by incorporating a rater parameter for each rater in an item response model that then was used to analyze the data. Some rater severities were found to be quite extreme, and the impact of this variation in rater severities on both total scores and trait level estimates was examined. Within-rater variation in rater severity also was examined and was found to have significant variation. The information contribution of the three modes was compared. Performance-based items provided more information than multiple-choice items and also provided greatest precision for higher levels of the latent variable. A projection-like method was applied to investigate the effects of assessment mode on the definition of the latent variable. The multiple-choice items added information to the performance-based variable. The results of the analysis also showed that the projection-like method did not practically differ from the results when the latent trait was defined jointly by both the multiple-choice and the performance-based items. Index terms: equating, linking, multiple assessment modes, polytomous item response models, rater effects. Multiple-choice (MC) items have been used widely in psychological and educational testing for many years. Administrative convenience and computerized scoring make them very convenient. However, MC items have been criticized as being inadequate to fully assess examinees' abilities. Moreover, test-wiseness may seriously contaminate the measurement. Recently, there has been an increased interest in performance-based (PB) items (or constructed-response items) as an alternative to Mac items. A PB item refers to any item format that requires the examinee to generate a response in any way other than selecting from a short list of alternative answers as in MC items (Pollack, Rock, & Jenkins, 1992). The different types of response formats, such as Mac items and the many types of PB items, are referred to here as different assessment modes. The main advantages of PB items are that: (1) they provide a more direct representation of content specifications (face validity and content validity), (2) they provide more diagnostic information about examinees' learning difficulties from their responses, (3) examinees prefer them to Mac items, and (4) the test formats may stimulate the teaching of important skills, such as problem solving and essay writing (Grima & Liang, 1992). However, Mac items are more economical to score and have well-established patterns of reliability. PB items are more difficult to score objectively and re...

show abstract

Section: Three Issuesmentioning

confidence: 99%

Complex Composites: Issues That Arise in Combining Different Modes of Assessment

Wilson

Wang

1995

Applied Psychological Measurement

View full text Add to dashboard Cite

show abstract

“…However, research has shown that substantial construct-irrelevant variance is introduced into essay scores as a consequence of the rating process alone ( Congdon and McQueen, 2000 ). Even if the rating rubric has been constructed carefully, the reliability and validity of the rating process still depends mainly on the implementation of the rating activities ( Overall and Magee, 1992 ). Because of variations in both the characteristics and status of raters, together with fluctuations between various rating environments, individual raters struggle to remain consistent across multiple rating processes, and different raters may assess the same samples differently.…”

Section: Introductionmentioning

confidence: 99%

Sequential Effects in Essay Ratings: Evidence of Assimilation Effects Using Cross-Classified Models

et al. 2017

View full text Add to dashboard Cite

Writing assessments are an indispensable part of most language competency tests. In our research, we used cross-classified models to study rater effects in the real essay rating process of a large-scale, high-stakes educational examination administered in China in 2011. Generally, four cross-classified models are suggested for investigation of rater effects: (1) the existence of sequential effects, (2) the direction of the sequential effects, and (3) differences in raters by their individual characteristics. We applied these models to the data to account for possible cluster effects caused by the application of multiple rating strategies. The results of our research showed that raters demonstrated sequential effects during the rating process. In contrast to many other studies on rater effects, our study found that raters exhibited assimilation effects. The more experienced, lenient, and qualified raters were less susceptible to assimilation effects. In addition, our research demonstrated the feasibility and appropriateness of using cross-classified models in assessing rater effects for such data structures. This paper also discusses the implications for educators and practitioners who are interested in reducing sequential effects in the rating process, and suggests directions for future research.

show abstract

“…Prominent among these sources is the variance associated with raters. This is a reflection of the concern that, no matter how carefully constructed, the reliability of a rating scale is critically dependent on the raters who operate it (Overall & Magee, 1992). As Dunbar, Koretz, and Hoover (1991) put it, "fallible raters can wreak havoc on the trustworthiness of scores and add a term to the reliability equation that does not exist in the tests that can be scored objectively."…”

mentioning

confidence: 99%

The Stability of Rater Severity in Large‐Scale Assessment Programs

Congdon

MeQueen

2000

J Educational Measurement

View full text Add to dashboard Cite

The purpose of this study was to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of 16 raters on writing performances of 8, 285 elementary school students. Each performance was rated by two trained raters over a period of seven rating days. Performances rated on the first day were re‐rated at the end of the rating period. Statistically significant differences between raters were found within each day and in all days combined. Daily estimates of the relative severity of individual raters were found to differ significantly from single, on‐average estimates for the whole rating period. For 10 raters, severity estimates on the last day were significantly different from estimates on the first day. These fndings cast doubt on the practice of using a single calibration of rater severity as the basis for adjustment of person measures.

show abstract

Estimating Individual Rater Reliabilities

Cited by 10 publications

References 15 publications

Complex Composites: Issues That Arise in Combining Different Modes of Assessment

Complex Composites: Issues That Arise in Combining Different Modes of Assessment

Sequential Effects in Essay Ratings: Evidence of Assimilation Effects Using Cross-Classified Models

The Stability of Rater Severity in Large‐Scale Assessment Programs

Contact Info

Product

Resources

About