ere has been an extensive debate within 1 the human resources industry over the merits of oral tests versus assembled, multiplechoice written tests. Written tests, claim their proponents, are more objective, more reliable and are better able to rank people on their competencies. For larger candidate populations, written tests are also much less expensive to administer.Oral test proponents contend that whatever their shortcomings, one can learn a great deal about candidates' communication skills, interpersonal skills and abilities to discuss a topic in depth during face-to-face discussions. For smaller candidate populations, oral tests are much closer in cost to written tests, and they may even be somewhat less expensive to administer and score. This article examines the inter-rater reliability of thirteen separate oral tests administered in 1996 by the New York State Department of Civil Service using statistical-based rating sheets (SBRS) as the measurement instrument. The article first seeks to determine whether oral test inter-rater reliability can be effectively measured with little interruption to the oral test process, and next examines the degree to which oral tests have inter-rater reliability, and hence have a claim to objectivity. It then looks at the effects of examiner collaboration on the initial or original judgments of the examiners. BACKGROUND Inter-Rater Reliability in Oral Tests Reliability is a term used in testing to measure the consistency of test results. While a test may have very favorable results once, what is the probability that these results will be repeated the next time the test is held? There are three types of reliability: inter-rater reliability (To what extent are the raters in agreement on the candidates' test performance?), the internal reliability of the measurement instrument (How effective is the measurement instrument in capturing useful information about different candidates' strengths and weaknesses?), and the reliability of the test materials (How likely are two candidates to produce the same results if they have essentially the same attributes? How likely is the same candidate to produce the same test results during a subsequent holding?). This article focuses on inter-rater reliability, but also provides statistical data on the internal reliability of the measurement instrument. Unless significantly flawed, a measurement instrument with more items will generally produce a higher internal reliability coefficient than one with fewer items.Kane (1992) discusses three levels of assessment : multiple-choice items, simulations (e.g. oral tests) and actual practice (work) situations. He viewed the merits of each from three inferences. The first inference is evaluation of the examination results. Are there correct answers, or will different raters disat UNIV OF GEORGIA LIBRARIES on May 30, 2015 rop.sagepub.com Downloaded from 44 agree on a candidate's performance? The second inference is generalization of the examination results. Since the examination represents just a sampl...