Based on evidence that listeners may favor certain foreign accents over others (Gass & Varonis, 1984; Major, Fitzmaurice, Bunta, & Balasubramanian, 2002; Tauroza & Luk, 1997) and that language-test raters may better comprehend and/or rate the speech of test takers whose native languages (L1s) are more familiar on some level (Carey, Mannell, & Dunn, 2011; Fayer & Krasinski, 1987; Scales, Wennerstrom, Richard, & Wu, 2006), we investigated whether accent familiarity (defined as having learned the test takers’ L1) leads to rater bias. We examined 107 raters’ ratings on 432 TOEFL iBTTM speech samples from 72 test takers. The raters of interest were L2 speakers of Spanish, Chinese, or Korean, while the test takers comprised three native-speaker groups (24 each) of Spanish, Chinese, and Korean. We analyzed the ratings using a multifaceted Rasch measurement approach. Results indicated that L2 Spanish raters were significantly more lenient with L1 Spanish test takers, as were L2 Chinese raters with L1 Chinese test takers. We conclude by concurring with Xi and Mollaun (2009, 2011) and Carey et al. that rater training should address raters’ linguistic background as a potential rater effect. Furthermore, we discuss the importance of recognizing rater L2 as a possible source of bias.
In this study, we describe a framework for monitoring rater performance over time. We present several statistical indices to identify raters whose standards drift and explain how to use those indices operationally. To illustrate the use of the framework, we analyzed rating data from the 2002 Advanced Placement English Literature and Composition examination, employing a multifaceted Rasch approach to determine whether raters exhibited evidence of two types of differential rater functioning over time (i.e., changes in levels of accuracy or scale category use). Some raters showed statistically significant changes in their levels of accuracy as the scoring progressed, while other raters displayed evidence of differential scale category use over time.
The purpose of this study was to examine, describe, evaluate, and compare the rating behavior of faculty consultants who scored essays written for the Advanced Placement English Literature and Composition (AP® ELC) Exam. Data from the 1999 AP ELC Exam were analyzed using FACETS (Linacre, 1998) and SAS. The faculty consultants were not all interchangeable in terms of the level of severity they exercised. If students' ratings had been adjusted for severity differences, the AP grades of about 30 percent of the students would have been different from the one they received. Almost all the differences were one grade or less. Adjusting ratings for faculty consultant severity differences would not impact some student subgroups more than others.
An Objective Structured Clinical Examination (OSCE) is an effective method for evaluating competencies. However, scores obtained from an OSCE are vulnerable to many potential measurement errors that cases, items, or standardized patients (SPs) can introduce. Monitoring these sources of errors is an important quality control mechanism to ensure valid interpretations of the scores. We describe how one can use generalizability theory (GT) and many-faceted Rasch measurement (MFRM) approaches in quality control monitoring of an OSCE. We examined the communication skills OSCE of 79 residents from one Midwestern university in the United States. Each resident performed six communication tasks with SPs, who rated the performance of each resident using 18 5-category rating scale items. We analyzed their ratings with generalizability and MFRM studies. The generalizability study revealed that the largest source of error variance besides the residual error variance was SPs/cases. The MFRM study identified specific SPs/cases and items that introduced measurement errors and suggested the nature of the errors. SPs/cases were significantly different in their levels of severity/difficulty. Two SPs gave inconsistent ratings, which suggested problems related to the ways they portrayed the case, their understanding of the rating scale, and/or the case content. SPs interpreted two of the items inconsistently, and the rating scales for two items did not function as 5-category scales. We concluded that generalizability and MFRM analyses provided useful complementary information for monitoring and improving the quality of an OSCE.
The analyses highlighted the fact that quality control monitoring is essential to ensure fairness when ranking candidates according to scores obtained in the MMI. The results can be used to identify examiners needing further training, or who should not be included again, as well as stations needing review. "Fair average" scores should be used for ranking the candidates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.