Complex Composites: Issues That Arise in Combining Different Modes of Assessment

Wilson, Mark; Wang, Wen-chung

doi:10.1177/014662169501900107

Cited by 27 publications

(18 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In line with earlier research on the statewide tests administered by the CDE (Wilson & Case, 2000;Wilson & Wang, 1995) and research by others from as early as the 1930s (e.g., Ashbum, 1938;Eells, 1930) significant variability between raters was found in their essay scoring. Moreover, significant variation was also found within raters overqtime, confirming findings reported by Braun (1988).…”

Section: Discussionsupporting

confidence: 86%

See 1 more Smart Citation

Real‐Time Feedback on Rater Drift in Constructed‐Response Items: An Example From the Golden State Examination

Hoskens

Wilson

2001

J Educational Measurement

Self Cite

View full text Add to dashboard Cite

In this study, patterns of variation in severities of a group of raters over time or so‐called “rater drift” was examined when raters scored an essay written under examination conditions. At the same time feedback was given to rater leaders (called “table leaders”) who then interpreted the feedback and reported to the raters. Rater severities in five successive periods were estimated using a modified linear logistic test model (LLTM, Fischer, 1973) approach. It was found that the raters did indeed drift towards the mean, but a planned comparision of the feedback with a control condition was not successful; it was believed that this was due to contamination at the table leader level. A series of models was also estimated designed to detect other types of rater effects beyond severity: a tendency to use extreme scores, and tendency to prefer certain categories. The models for these effects were found to be showing significant improvement in fit, implying that these effects were indeed present, although they were difficult to detect in relatively short time periods.

show abstract

Section: Discussionsupporting

confidence: 86%

“…An advantage of IRT also is that it can easily handle data that are obtained with a scoring scale with a limited range. Several authors have used IRT models to document the impact of rater effects on student scores (e.g., Engelhard, 1994Engelhard, , 1996Lunz, Wright, & Linacre, 1990;Myford & Mislevy, 1995;Patz, 1996;Wilson & Wang, 1995;Wolfe & Myford, 1997).…”

mentioning

confidence: 99%

Real‐Time Feedback on Rater Drift in Constructed‐Response Items: An Example From the Golden State Examination

Hoskens

Wilson

2001

J Educational Measurement

Self Cite

View full text Add to dashboard Cite

show abstract

“…This uncertainty, if not closely monitored, can have devastating consequences. Wilson and Wang (1995) have described the observed impact of varying rater severities on the scores of students in CLAS mathematics performance tasks: they found that the expected raw score differences resulting from scorer severity differences could be as large as 2.0 score points (on a 6-point scale), and that these raw score effects would result in a difference of 10 percentile points for typical students on the particular test under study. Koretz, Stecher, Klein, and McCaffrey (1994) show that Vermont's portfolio assessment program was unable to report school-level scores due primarily to unacceptably high variability between raters.…”

Section: Application: Model-based Approaches To Rater Effectsmentioning

confidence: 99%

“…Software to apply re- stricted cases of the LLTM (so-called facets models) has been developed by Linacre (1989), as has software that can estimate models specified under the full LLTM approach (Adams & Wilson, 1996;Adams, Wilson, & Wu, 1997;Ponocny & Ponocny-Seliger, 1997). The technique has been applied to rater effect estimation by Engelhard (1994Engelhard ( , 1996, Myford and Mislevy (1995), and Wilson and Wang (1995). The LLTM rater model for a dichotomous item j taken by examinee /' and rated by rater r, has IRF PUT = W* = 1M,P,) = 1+ ex P -' ( e,-P;-p r r (1?)…”

Section: Irt Models For Rater Effectsmentioning

confidence: 99%

Applications and Extensions of MCMC in IRT: Multiple Item Types, Missing Data, and Rated Responses

Patz¹,

Junker

1999

Journal of Educational and Behavioral Statistics

250

View full text Add to dashboard Cite

Patz and Junker (1999) describe a general Markov chain Monte Carlo (MCMC) strategy, based on Metropolis-Hastings sampling, for Bayesian inference in complex item response theory (IRT) settings. They demonstrate the basic methodology using the two-parameter logistic (2PL) model. In this paper we extend their basic MCMC methodology to address issues such as nonresponse, designed missingness, multiple raters, guessing behavior and partial credit (polytomous) test items. We apply the basic MCMC methodology to two examples from the National Assessment of Educational Progress 1992 Trial State Assessment in Reading: (a) a multiple item format (2PL, 3PL, and generalized partial credit) subtest with missing response data; and (b) a sequence of rated, dichotomous short-response items, using a new IRT model called the generalized linear logistic test model (GLLTM).Patz and Junker (1999) describe a general Markov chain Monte Carlo (MCMC) strategy for Bayesian inference in complex item response theory (IRT; Lord, 1980) settings. They demonstrate that the two-parameter logistic (2PL) item response model can be fit accurately using MCMC. Educational assessments, such as the National Assessment of Educational Progress (NAEP), however, typically require more complex models to account for factors such as non-response or missingness, guessing behavior, partial credit scoring, and multiple raters. In this paper we extend their basic MCMC methodology to address these issues.

show abstract

“…Although recognizing advantages to performance-based examinations (i.e., simulations and performance observations) versus multiple choice examinations, Wilson & Wang (1995, pp.52-53) discuss concerns for rater effects in performance-based examinations, including inter-rater variation in rater severity and within-rater variation of rater severity. Are some raters harder or easier than others?…”

Section: Introductionmentioning

confidence: 99%

Using Statistical-Based Rating Sheets to Measure Oral Test Inter-Rater Reliability

Southworth

2000

Review of Public Personnel Administration

View full text Add to dashboard Cite

ere has been an extensive debate within 1 the human resources industry over the merits of oral tests versus assembled, multiplechoice written tests. Written tests, claim their proponents, are more objective, more reliable and are better able to rank people on their competencies. For larger candidate populations, written tests are also much less expensive to administer.Oral test proponents contend that whatever their shortcomings, one can learn a great deal about candidates' communication skills, interpersonal skills and abilities to discuss a topic in depth during face-to-face discussions. For smaller candidate populations, oral tests are much closer in cost to written tests, and they may even be somewhat less expensive to administer and score. This article examines the inter-rater reliability of thirteen separate oral tests administered in 1996 by the New York State Department of Civil Service using statistical-based rating sheets (SBRS) as the measurement instrument. The article first seeks to determine whether oral test inter-rater reliability can be effectively measured with little interruption to the oral test process, and next examines the degree to which oral tests have inter-rater reliability, and hence have a claim to objectivity. It then looks at the effects of examiner collaboration on the initial or original judgments of the examiners. BACKGROUND Inter-Rater Reliability in Oral Tests Reliability is a term used in testing to measure the consistency of test results. While a test may have very favorable results once, what is the probability that these results will be repeated the next time the test is held? There are three types of reliability: inter-rater reliability (To what extent are the raters in agreement on the candidates' test performance?), the internal reliability of the measurement instrument (How effective is the measurement instrument in capturing useful information about different candidates' strengths and weaknesses?), and the reliability of the test materials (How likely are two candidates to produce the same results if they have essentially the same attributes? How likely is the same candidate to produce the same test results during a subsequent holding?). This article focuses on inter-rater reliability, but also provides statistical data on the internal reliability of the measurement instrument. Unless significantly flawed, a measurement instrument with more items will generally produce a higher internal reliability coefficient than one with fewer items.Kane (1992) discusses three levels of assessment : multiple-choice items, simulations (e.g. oral tests) and actual practice (work) situations. He viewed the merits of each from three inferences. The first inference is evaluation of the examination results. Are there correct answers, or will different raters disat UNIV OF GEORGIA LIBRARIES on May 30, 2015 rop.sagepub.com Downloaded from 44 agree on a candidate's performance? The second inference is generalization of the examination results. Since the examination represents just a sampl...

show abstract

Complex Composites: Issues That Arise in Combining Different Modes of Assessment

Cited by 27 publications

References 17 publications

Real‐Time Feedback on Rater Drift in Constructed‐Response Items: An Example From the Golden State Examination

Real‐Time Feedback on Rater Drift in Constructed‐Response Items: An Example From the Golden State Examination

Applications and Extensions of MCMC in IRT: Multiple Item Types, Missing Data, and Rated Responses

Using Statistical-Based Rating Sheets to Measure Oral Test Inter-Rater Reliability

Contact Info

Product

Resources

About