Exploring the Effects of Rater Linking Designs and Rater Fit on Achievement Estimates Within the Context of Music Performance Assessments

Jones

2019

Self Cite

Researchers have explored a variety of topics related to identifying and distinguishing among specific types of rater effects, as well as the implications of different types of incomplete data collection designs for rater-mediated assessments. In this study, we used simulated data to examine the sensitivity of latent trait model indicators of three rater effects (leniency, central tendency, and severity) in combination with different types of incomplete rating designs (systematic links, anchor performances, and spiral). We used the rating scale model and the partial credit model to calculate rater location estimates, standard errors of rater estimates, model-data fit statistics, and the standard deviation of rating scale category thresholds as indicators of rater effects and we explored the sensitivity of these indicators to rater effects under different conditions. Our results suggest that it is possible to detect rater effects when each of the three types of rating designs is used. However, there are differences in the sensitivity of each indicator related to type of rater effect, type of rating design, and the overall proportion of effect raters. We discuss implications for research and practice related to rater-mediated assessments.

Section: Literature Reviewmentioning

confidence: 99%

Section: Literature Reviewmentioning

confidence: 99%

The Effects of Incomplete Rating Designs in Combination With Rater Effects

Jones

2019

Self Cite

“…Additionally, in some data collection procedures, it may not be feasible for all raters to evaluate all persons used in the study. In rater‐mediated assessments, the connectivity of raters can affect the empirical results of the assessment context (Wind, Engelhard, & Wesolowski, ). Therefore, it is an important research design consideration that warrants researcher specification.…”

Section: Model I: Observation Modelmentioning

confidence: 99%

“…A data set of example ratings in which 20 raters rated 100 persons on three domains using a five‐category rating scale (0, 1, 2, 3, 4) was simulated to illustrate the interpretation of each component of Equation , where lower numbers indicate lower judged scores of the person and higher numbers indicate higher judged scores of the person. Complete assessment networks, where all raters rate all persons, are theoretically ideal and desirable; however, most large‐scale operational rater‐mediated assessment systems involve various forms of incomplete assessment networks due to time, money, and other administrative constraints (Wind, Engelhard, & Wesolowski, ). As Engelhard () notes, incomplete assessment network designs, when constructed using sound data collection designs, “obtain reliable and valid links both within and between facets that are less costly in terms of examinee time and rater salaries” (p. 27).…”

Section: Model 2: Measurement Modelmentioning

confidence: 99%

Pedagogical Considerations for Examining Rater Variability in Rater‐Mediated Assessments: A Three‐Model Framework

Wesolowski¹,

2019

Self Cite

Rater‐mediated assessments are a common methodology for measuring persons, investigating rater behavior, and/or defining latent constructs. The purpose of this article is to provide a pedagogical framework for examining rater variability in the context of rater‐mediated assessments using three distinct models. The first model is the observation model, which includes ecological/environmental considerations for the evaluation system. The second model is the measurement model, which includes the transformation of observed, rater response data to linear measures using a measurement model with specific requirements of rater‐invariant measurement in order to examine raters’ construct‐relevant variability stemming from the evaluative system. The third model is the interaction model, which includes an interaction parameter to allow for the investigation into raters’ systematic, construct‐irrelevant variability stemming from the evaluative system. Implications for measurement outcomes and validity are discussed.

“…Connectivity within teacher evaluation systems is critical because it allows researchers and practitioners to compare teacher performance and principal severity in situations where it is not possible for every principal to rate every teacher. These comparisons cannot be made without systematic connections among different principals and teachers, because it is impossible to separate teachers’ ratings from principals’ severity (Engelhard, ; Linacre, ; Lunz & Linacre, ; Schumacker, ; Wind, Engelhard, & Wesolowski, ). In other words, analyzing teacher evaluation data without taking into account differences between principals can introduce potential bias into the ratings.…”

mentioning

confidence: 99%

Exploring the Influence of Range Restrictions on Connectivity in Sparse Assessment Networks: An Illustration and Exploration Within the Context of Classroom Observations

Jones

2018

Self Cite

Range restrictions, or raters' tendency to limit their ratings to a subset of available rating scale categories, are well documented in large-scale teacher evaluation systems based on principal observations. When these restrictions occur, the ratings observed during operational teacher evaluations are limited to a subset of the available categories. However, range restrictions are less common within teacher performances that are used to establish links (anchor ratings) in otherwise disconnected assessment systems. As a result, principals' category use may be different between anchor ratings and operational ratings. The purpose of this study is to explore the consequences of discrepancies in rating scale category use across operational and anchor ratings within the context of teacher evaluation systems based on principal observations. First, we used real data to illustrate the presence of range restriction in operational ratings, and the effect of this restriction on connectivity. Then, we used simulated data to explore these effects using experimental manipulation. Results suggested that discrepancies in range restriction between anchor and operational ratings do not systematically impact the precision of teacher, principal, and teaching practice estimates. We discuss the implications of these results in terms of research and practice for teacher evaluation systems.