Equating test forms is an essential activity in standardized testing, with increased importance with the accountability systems in existence through the mandate of Adequate Yearly Progress. It is through equating that scores from different test forms become comparable, which allows for the tracking of changes in the performance of students from one year to the next. This study compares three different item response theory scaling methods (fixed common item parameter, Stocking & Lord, and Concurrent Calibration) with respect to examinee classification into performance categories, and estimation of the ability parameter, when the content of the test form changes slightly from year to year, and the examinee ability distribution changes. The results indicate that calibration methods, especially concurrent calibration, produced more stable results than the transformation method.Equating test forms is an essential activity in standardized testing, with increased importance with the accountability systems in existence through the mandate of adequate yearly progress (AYP). It is through equating that scores from different test forms become comparable, which allows for the tracking of changes in the performance of students from one year to the next. Given that there are different methods for equating, the choice of equating methods could lead to different results, and different inferences about the nature of the achievement of the students in question. As such, choosing the most appropriate equating method for the particular situation is essential. While there may not be an equating method that is best in all situations, studies that examine the context of interest can help inform the choice of methods.In item response theory (IRT), there are primarily two methods of equating: observed score equating and true score equating. Although there are only two popular methods of equating the tests, in the context of IRT, the heart of equating is actually a scaling step in which the parameter estimates from different calibrations are put onto a common scale, or metric. There are several popular scaling techniques that are usually implemented in IRT, and the research regarding these methods is not conclusive. The best method typically depends on the context of the equating. In particular, the degree of equivalence between groups that are used in the scaling and equating appears to be a factor that differentiates scaling methods. In the case of equivalent groups, very little difference in results have been observed while with non-equivalent groups, differences have