Marilyn S. Wingersky scite author profile

Two methods of 'equating' tests are compared, one using true scores, the other using equipercentile equating of observed scores. The theory of equating is discussed. For the data studied, the two methods yield almost indistinguishable results. Most item response theory (IRT) equating is currently attempted by the true-score equating procedure described in Lord (1980, chap. 13). Lord also described an IRT equipercentile observed-score procedure, which until now seems to have been little used in operational work, perhaps because it is more complicated and more expensive than the true-score procedure. This paper discusses theoretical considerations and reports an empirical research study comparing the results of applying these two procedures to real test data. Note that IRT plays only a subsidiary role in observed-score equipercentile equating; similar results could be expected for conventional equipercentile equating, assuming that the IRT model holds. Kolen (1981) found the equipercentile observed-score IRT procedure to be one of the better of nine procedures compared in his empirical research study. However, his criterion was stability in crossvalidation. Although stability is certainly desirable, stability is not a proper criterion for choosing the best equating method: Incorrect equating procedures may yield more stable results than correct procedures. Sections 1 and 2 outline the true-score procedure and the observed-score equipercentile procedure, respectively. Section 3 discusses the theoretical advantages and disadvantages of each procedure. Section 4 describes the real test data used to provide a comparison of the two methods. Section 5 describes the procedures used for estimating item and ability parameters. Section 6 reports and summarizes the empirical results. IRT models the probability of a correct response by an examinee to a test item as a monotonically increasing function of ability. The model used here is the three-parameter logistic model given by where Pi(O,) is the probability of examinee a answering item correctly, bi is the difficulty of item i, 9 ai is the discrimination index for item 9

show abstract

How to Equate Tests With Little or No Data

Mislevy

Sheehan

Wingersky

1992

ETS Research Report Series

View full text Add to dashboard Cite

Standard procedures for equating tests, including those based on item response theory (IRT), require item responses from large numbers of examinees. Such data may not be forthcoming for reasons theoretical, political, or practical. Information about items' operating characteristics may be available from other sources, however, such as content and format specifications, expert opinion, or psychological theories about the skills and strategies required to solve them. This paper shows how, in the IRT framework, collateral information about items can be exploited to augment or even replace examinee responses when linking or equating new tests to established scales. The procedures are illustrated with data from the Pre‐Professional Skills Test (PPST).

show abstract

A Simulation Study of Methods for Assessing Differential Item Functioning in Computerized Adaptive Tests

Zwick

Thayer

Wingersky

1994

Applied Psychological Measurement

View full text Add to dashboard Cite

Simulated data were used to investigate the performance of modified versions of the Mantel-Haenszel method of differential item functioning (DIF) analysis in computerized adaptive tests (CATs). Each simulated examinee received 25 items from a 75-item pool. A three-parameter logistic item response theory (IRT) model was assumed, and examinees were matched on expected true scores based on their CAT responses and estimated item parameters. The CAT-based DIF statistics were found to be highly correlated with DIF statistics based on nonadaptive administration of all 75 pool items and with the true magnitudes of DIF in the simulation.Average DIF statistics and average standard errors also were examined for items with various characteristics.Finally, a study was conducted of the accuracy with which the modified Mantel-Haenszel procedure could identify CAT items with substantial DIF using a classification system now implemented by some testing programs. These additional analyses provided further evidence that the CAT-based DIF procedures performed well. More generally, the results supported the use of IRT-based matching variables in DIF analysis. Index terms: adaptive testing, computerized adaptive testing, differential item functioning, item bias, item response theory.Many large-scale testing programs are now piloting or implementing computerized adaptive tests (CATS).

show abstract

An Investigation of Methods for Reducing Sampling Error in Certain Irt Procedures*

Wingersky¹,

Lord²

1983

ETS Research Report Series

View full text Add to dashboard Cite

The sampling errors of maximum likelihood estimates of item‐response theory parameters are studied in the case where both people and item parameters are estimated simultaneously. A check on the validity of the standard error formulas is carried out. The effect of varying sample size, test length, and the shape of the ability distribution is investigated. Finally, the effect of anchor‐test length on the standard error of item parameters is studied numerically for the situation, common in equating studies, where two groups of examinees each take a different test form together with the same anchor test. The results encourage the use of rectangular or bimodal ability distributions, also the use of very short anchor tests.

show abstract

An Investigation of Methods for Reducing Sampling Error in Certain IRT Procedures

Wingersky

Lord

1984

Applied Psychological Measurement

View full text Add to dashboard Cite

The sampling errors of maximum likelihood estimates of item response theory parameters are studied in the case when both people and item parameters are estimated simultaneously. A check on the validity of the standard error formulas is carried out. The effect of varying sample size, test length, and the shape of the ability distribution is investigated. Finally, the ef-fect of anchor-test length on the standard error of item parameters is studied numerically for the situation, common in equating studies, when two groups of examinees each take a different test form together with the same anchor test. The results encourage the use of rectangular or bimodal ability distributions, and also the use of very short anchor tests.Until recently, the asymptotic sampling variances and covariances for maximum likelihood estimates of item parameters in item response theory (IRT) have usually been computed by assuming abilities to be known. Conversely, the asymptotic sampling variances and covariances for ability estimates have been computed by assuming the item parameters to be known. In this paper, a suggested method for computing the asymptotic sampling variance-covariance matrix of joint maximum likelihood estimates when all parameters are unknown (Lord & Wingersky, in press) is used to try to answer various practical questions. (For many purposes, an alternative approach has recently become available: the use of marginal maximum likelihood estimation, exemplified by BILOG [Mislevy & Bock, 1981], which provides asymptotic sampling variances for the estimates obtained. This approach was not available to the authors at the time the investigation reported here was initiated. It is not discussed here.) Throughout this paper all sampling variances, covariances, and standard errors are asymptotic. Section 2 presents needed additional, though not conclusive, evidence that the Lord-Wingersky method for computing the sampling variance-covariance matrix yields correct results. Section 3 investigates the effect of changing the number of items, the number of people, or the distribution of ability, on the standard errors of both the item parameters and the abilities. Section 4 presents a technique for displaying and understanding the standard errors and sampling covariances of estimates of item parameters. Section 5 deals with the situation when there are two tests that contain a set of items in common and these tests are administered to two separate groups of examinees. An important problem in item banking or test equating is to put the parameter estimates for the two tests on a common scale. One way to do this is to estimate all of the parameters for both tests in one calibration run. When this is done,

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.