2014
DOI: 10.1177/0265532214536171
|View full text |Cite
|
Sign up to set email alerts
|

An examination of rater performance on a local oral English proficiency test: A mixed-methods approach

Abstract: This paper reports on a mixed-methods approach to evaluate rater performance on a local oral English proficiency test. Three types of reliability estimates were reported to examine rater performance from different perspectives. Quantitative results were also triangulated with qualitative rater comments to arrive at a more representative picture of rater performance and to inform rater training. Specifically, both quantitative (6338 valid rating scores) and qualitative data (506 sets of rater comments) were ana… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
33
1
2

Year Published

2016
2016
2020
2020

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 37 publications
(40 citation statements)
references
References 34 publications
4
33
1
2
Order By: Relevance
“…Table also reports the outfit and infit indexes, which concern how well raters’ ratings fit the Rasch model or can be predicted by it (Bachman, ; Bond & Fox, ; Eckes, ). In assessing infit mean square and outfit mean square, although both dimensions are ideally expected to be 1.00 (Bond & Fox, ), Linacre () suggests reasonable fit can fall in between .50 and 1.50, which is widely applied in previous studies (e.g., He et al., ; Saeidi et al., ; Yan, ) as the criterion for evaluating both fit statistics.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Table also reports the outfit and infit indexes, which concern how well raters’ ratings fit the Rasch model or can be predicted by it (Bachman, ; Bond & Fox, ; Eckes, ). In assessing infit mean square and outfit mean square, although both dimensions are ideally expected to be 1.00 (Bond & Fox, ), Linacre () suggests reasonable fit can fall in between .50 and 1.50, which is widely applied in previous studies (e.g., He et al., ; Saeidi et al., ; Yan, ) as the criterion for evaluating both fit statistics.…”
Section: Resultsmentioning
confidence: 99%
“…Vertical rulers for rater severity and metaphoric topic. fall in between .50 and 1.50, which is widely applied in previous studies (e.g., He et al, 2013;Saeidi et al, 2013;Yan, 2014) as the criterion for evaluating both fit statistics.…”
Section: Cross-cultural Differences In Rating Severitymentioning
confidence: 99%
“…As was the case with writing, research has examined rater bias towards certain scale criteria and groups of test takers from the same language background (e.g. Yan, 2014), without examining possible causes of such biases. Other studies have focused on examining specific rater background variables and have investigated whether these can be attributed to certain biases in a dataset.…”
Section: Rater Effectsmentioning
confidence: 99%
“…Sin embargo, por la pluralidad y diversidad de cada examinando individual una sola referencia no puede representar todas las características de sus expresiones orales. En las últimas investigaciones sobre el test oral, se presta una atención cada vez mayor a la coincidencia entre las calificaciones de cada uno de los evaluadores (Kim 2009;Yan 2014). En este trabajo llevamos a cabo un análisis del discurso mediante redes complejas sobre el test oral del examen EEE-4 (Examen del Español como Especialidad -Nivel 4), intentando encontrar una nueva metodología para ayudar a calificar las expresiones orales en diferentes niveles, lo que se propone como un apoyo potencial para los evaluadores.…”
unclassified