2015
DOI: 10.1177/0265532215575626
|View full text |Cite
|
Sign up to set email alerts
|

Determining the scoring validity of a co-constructed CEFR-based rating scale

Abstract: Considering scoring validity as encompassing both reliable rating scale use and valid descriptor interpretation, this study reports on the validation of a CEFR-based scale that was co-constructed and used by novice raters. The research questions this paper wishes to answer are (a) whether it is possible to construct a CEFR-based rating scale with novice raters that yields reliable ratings and (b) allows for a uniform interpretation of the descriptors. Additionally, this study focuses on the question whether co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
10
0
1

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 28 publications
(15 citation statements)
references
References 32 publications
1
10
0
1
Order By: Relevance
“…For example, Ling, Mollaun, and Xi’s (2014, p. 479) assertion that “Scoring quality is critical to the validity and fairness of a test” makes a connection between rating, validity, and fairness for tests with constructed responses. Deygers and Van Gorp (2015) make a similar point by drawing upon Weir’s concept of “scoring validity” (Weir, 2005, p. 24), which is seen as a type of validity that subsumes both reliability and validity. Based on their understanding of Weir’s reference to scoring validity, Deygers and Van Gorp (2015) assert, “One aspect of scoring validity is rater reliability, that is, the extent to which raters are consistent with their own and with other raters’ rating” (p. 523).…”
Section: Rating Processes and Validationmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, Ling, Mollaun, and Xi’s (2014, p. 479) assertion that “Scoring quality is critical to the validity and fairness of a test” makes a connection between rating, validity, and fairness for tests with constructed responses. Deygers and Van Gorp (2015) make a similar point by drawing upon Weir’s concept of “scoring validity” (Weir, 2005, p. 24), which is seen as a type of validity that subsumes both reliability and validity. Based on their understanding of Weir’s reference to scoring validity, Deygers and Van Gorp (2015) assert, “One aspect of scoring validity is rater reliability, that is, the extent to which raters are consistent with their own and with other raters’ rating” (p. 523).…”
Section: Rating Processes and Validationmentioning
confidence: 99%
“…Deygers and Van Gorp (2015) make a similar point by drawing upon Weir’s concept of “scoring validity” (Weir, 2005, p. 24), which is seen as a type of validity that subsumes both reliability and validity. Based on their understanding of Weir’s reference to scoring validity, Deygers and Van Gorp (2015) assert, “One aspect of scoring validity is rater reliability, that is, the extent to which raters are consistent with their own and with other raters’ rating” (p. 523). Weir’s approach of adding scoring validity alongside what he termed “traditional validities” (2005, p. 24) highlights the importance of rating issues by combining them within scoring validity.…”
Section: Rating Processes and Validationmentioning
confidence: 99%
“…His conceptualization of measurement-driven rating scale construction has highlighted the shortcomings of certain types of level descriptors, has unearthed mismatches between rating criteria and real-world TLU characteristics, and has stressed the need for empirically founded criteria (but see also Alderson, 2007;Jacoby & McNamara, 1999). However, various publications in the field of language testing have shown that a dichotomous rating scale typology (i.e., measurement-driven vs. performancedriven) may not correspond to actual practice, as many rating scales emerge from a variety of sources, including expert input, empirical performance data, and existing language proficiency frameworks such as the CEFR (Deygers & Van Gorp, 2015;Galaczi et al, 2011;Harsch & Martin, 2012;Knoch, 2009). Even the CEFR (Council of Europe, 2001) too, criticized by Fulcher (2004Fulcher ( , 2012 as an example of statistically driven, intuition-based design, describes rating scale development as the process of combining intuitive, qualitative, and quantitative methods.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Nevertheless, even when the rating reliability indices are high, and even when MFRM analyses are applied methodically and rigorously, there are no guarantees that the raters will interpret the same criteria similarly. In fact, empirical studies suggest, that the interpretation of a rating scale is fundamentally impacted by rater experience, task types, surface elements, and rater intuition (Lumley 2002;Barkaoui 2010;Fulcher et al 2011;Isaacs and Thomson 2013), which in turn, raises important issues regarding scoring validity (Harsch and Martin 2013;Deygers and Van Gorp 2015).…”
Section: A Broader Conception Of Fairnessmentioning
confidence: 99%