This study examined the effectiveness of the threeparameter IRT model in vertically equating five overlapping levels of a mathematics computation test. One to four test levels were administered within intact classrooms to randomly equivalent groups of third through eighth grade students. Test characteristic curves were derived for each grade/test level combination. It was generally found that an examinee would receive a higher ability estimate if the test level administered had been calibrated on less able examinees.Practical implications for "out-of-level" and adaptive testing are discussed.It is often considered desirable to test a student in a given subject matter area periodically throughout his/her formal schooling, and to compare the scores obtained across the various testings. Because knowledge in many subject areas is closely linked to school curricula, standardized achievement tests are usually developed in levels that attempt to rnirr®r '6tYp~c~l&dquo; curriculum placement of different aspects of a subject area. This usually results in a standardized test battery with levels corresponding, at least roughly, to grades in school.In order to compare test scores across these levels, a scale must be developed that allows comparisons of raw scores obtained on tests differing in content and difficulty. This is the problem that vertical equating attempts to solve-how to develop a score scale across test levels which (1) differ in difficulty and (2) are designed for groups of examinees who differ in average ability level.This study was designed to examine the effectiveness of the three-parameter item response theory (IRT) model in vertically equating the mathematics computation test of the Iowa Tests of Basic Skills (Hieronymus, Lindquist, & Hoover, 1977).IRT methods are frequently suggested as the prefeffed vertical equating approach for two reasons:(1) It is recognized that problems exist with the classical test theory methods (see, e.g., Lord, 197?; Lord & Wingersky, 1984), and (2) IRT methods are usually conceived of as having &dquo;person-free&dquo; 9 calibration and &dquo;item-free&dquo; measurement. These properties imply that the item parameters which are estimated are invariant for all subgroups of examinees, and that, once the items are calibrated, the same 0 estimate would be obtained (except for errors of measurement) for an individual regardless of the subset of items he/she was administered. These properties, if they held, would essentially solve the problem of vertical equating.The two IRT models that have been most prominent in the vertical equating literature are the oneparameter (Rasch) model and the three-parameter model (see, e.g., Hambleton & Swaminathan, 1984). Although the Rasch model possesses certain desirable properties, such as simplicity and a monotonic relationship between raw score and estimated examinee ability, there are indications that the model does not perform well in practice in vertical equatat Kungl Tekniska Hogskolan / Royal Institute of Technology on August 24, 20...