Educational assessment often involves students' constructed work (e.g., artistic skills, essay writing, and science projects), which requires raters to make subjective scorings. Many other areas, such as job interviews and performance appraisals in beauty, singing, or sport contests, also involve raters. In order to reduce rater error and increase reliability, an item response, or a performance is often graded by multiple raters in a process that is referred to as "multiple ratings." There are several approaches to the analysis of multiple ratings. Within the framework of generalizability theory (Brennan, 2001), the total variance is partitioned into variance components associated with person, item, rater, and their two-way and three-way interactions. While generalizability theory is useful, it is limited by specifying the mathematical relationship among the components as linear, which is not appropriate for categorical item responses. Nevertheless, the variance decomposition of effects for person, item, and rater is helpful in the formulation of complicated modeling for rater effects. We shall focus on item response theory (IRT) models for rater effects in this study.
Facets ModelsThe facets model (Linacre, 1989) and the hierarchical rater model (HRM;Patz, Junker, Johnson, & Mariano, 2002) are two popular IRT models for rater effect.
260