Determining the scoring validity of a co-constructed CEFR-based rating scale

Deygers, Bart; Gorp, Koen Van

doi:10.1177/0265532215575626

Cited by 28 publications

(15 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For example, Ling, Mollaun, and Xi’s (2014, p. 479) assertion that “Scoring quality is critical to the validity and fairness of a test” makes a connection between rating, validity, and fairness for tests with constructed responses. Deygers and Van Gorp (2015) make a similar point by drawing upon Weir’s concept of “scoring validity” (Weir, 2005, p. 24), which is seen as a type of validity that subsumes both reliability and validity. Based on their understanding of Weir’s reference to scoring validity, Deygers and Van Gorp (2015) assert, “One aspect of scoring validity is rater reliability, that is, the extent to which raters are consistent with their own and with other raters’ rating” (p. 523).…”

Section: Rating Processes and Validationmentioning

confidence: 99%

“…Deygers and Van Gorp (2015) make a similar point by drawing upon Weir’s concept of “scoring validity” (Weir, 2005, p. 24), which is seen as a type of validity that subsumes both reliability and validity. Based on their understanding of Weir’s reference to scoring validity, Deygers and Van Gorp (2015) assert, “One aspect of scoring validity is rater reliability, that is, the extent to which raters are consistent with their own and with other raters’ rating” (p. 523). Weir’s approach of adding scoring validity alongside what he termed “traditional validities” (2005, p. 24) highlights the importance of rating issues by combining them within scoring validity.…”

Section: Rating Processes and Validationmentioning

confidence: 99%

See 1 more Smart Citation

Validation of rating processes within an argument-based framework

Knoch

Chapelle

2017

Language Testing

View full text Add to dashboard Cite

Argument-based validation requires test developers and researchers to specify what is entailed in test interpretation and use. Doing so has been shown to yield advantages (Chapelle, Enright, & Jamieson, 2010), but it also requires an analysis of how the concerns of language testers can be conceptualized in the terms used to construct a validity argument. This article presents one such analysis by examining how issues associated with the rating of test takers’ linguistic performance can be included in a validity argument. Through a manual search of published language testing research, we gathered examples of research studies investigating the quality of rating processes and products. We then analyzed them in terms of how the research could be framed within a validity argument. Drawing on Kane’s (2001, 2006, 2013) conceptualization of inferences, warrants, and assumptions, we show that the relevance of research about the rating of test performances extends beyond one or two inferences about rater reliability. Such research results, for example, provide backing for assumptions about the correspondence of the rating scale to the test construct (explanation inference) and the context of extrapolation as well as the decisions made based on the ratings and their consequences. Our analysis reveals a picture of the extensive reach of the rating process into many aspects of test score meaning as well as concrete suggestions for integrating rating issues into future argument-based validation studies.

show abstract

Section: Rating Processes and Validationmentioning

confidence: 99%

Section: Rating Processes and Validationmentioning

confidence: 99%

Validation of rating processes within an argument-based framework

Knoch

Chapelle

2017

Language Testing

View full text Add to dashboard Cite

show abstract

“…His conceptualization of measurement-driven rating scale construction has highlighted the shortcomings of certain types of level descriptors, has unearthed mismatches between rating criteria and real-world TLU characteristics, and has stressed the need for empirically founded criteria (but see also Alderson, 2007;Jacoby & McNamara, 1999). However, various publications in the field of language testing have shown that a dichotomous rating scale typology (i.e., measurement-driven vs. performancedriven) may not correspond to actual practice, as many rating scales emerge from a variety of sources, including expert input, empirical performance data, and existing language proficiency frameworks such as the CEFR (Deygers & Van Gorp, 2015;Galaczi et al, 2011;Harsch & Martin, 2012;Knoch, 2009). Even the CEFR (Council of Europe, 2001) too, criticized by Fulcher (2004Fulcher ( , 2012 as an example of statistically driven, intuition-based design, describes rating scale development as the process of combining intuitive, qualitative, and quantitative methods.…”

Section: Literature Reviewmentioning

confidence: 99%

Revisiting rating scale development for rater-mediated language performance assessments: Modelling construct and contextual choices made by scale developers

2021

Self Cite

View full text Add to dashboard Cite

Rating scale development in the field of language assessment is often considered in dichotomous ways: It is assumed to be guided either by expert intuition or by drawing on performance data. Even though quite a few authors have argued that rating scale development is rarely so easily classifiable, this dyadic view has dominated language testing research for over a decade. In this paper we refine the dominant model of rating scale development by drawing on a corpus of 36 studies identified in a systematic review. We present a model showing the different sources of scale construct in the corpus. In the discussion, we argue that rating scale designers, just like test developers more broadly, need to start by determining the purpose of the test, the relevant policies that guide test development and score use, and the intended score use when considering the design choices available to them. These include considering the impact of such sources on the generalizability of the scores, the precision of the post-test predictions that can be made about test takers’ future performances and scoring reliability. The most important contributions of the model are that it gives rating scale developers a framework to consider prior to starting scale development and validation activities.

show abstract

“…Nevertheless, even when the rating reliability indices are high, and even when MFRM analyses are applied methodically and rigorously, there are no guarantees that the raters will interpret the same criteria similarly. In fact, empirical studies suggest, that the interpretation of a rating scale is fundamentally impacted by rater experience, task types, surface elements, and rater intuition (Lumley 2002;Barkaoui 2010;Fulcher et al 2011;Isaacs and Thomson 2013), which in turn, raises important issues regarding scoring validity (Harsch and Martin 2013;Deygers and Van Gorp 2015).…”

Section: A Broader Conception Of Fairnessmentioning

confidence: 99%

Fairness and Social Justice in English Language Assessment

Deygers

2019

Springer International Handbooks of Education

Self Cite

View full text Add to dashboard Cite

This chapter offers a critical historical overview of the research into and conceptions of fairness and justice in the language assessment literature. The focus is primarily on high-stakes assessment practices, since this is the area in which most of the relevant research is conducted. In order to clarify the meaning and origin of the current conceptions of fairness and justice, this chapter opens by going back to the original works in moral and political philosophy. Afterwards, fairness, justice, and their relationship to validity are discussed.

show abstract

Determining the scoring validity of a co-constructed CEFR-based rating scale

Cited by 28 publications

References 32 publications

Validation of rating processes within an argument-based framework

Validation of rating processes within an argument-based framework

Revisiting rating scale development for rater-mediated language performance assessments: Modelling construct and contextual choices made by scale developers

Fairness and Social Justice in English Language Assessment

Contact Info

Product

Resources

About