In recent years there has been an increasing emphasis on assessment results, as well as increasing concern about the nature of the most widely used forms of student assessment and uses that are made of the results. These conflicting forces have helped create a burgeoning interest in alternative forms of assessments, particularly complex, performance-based assessments. It is argued that there is a need to rethink the criteria by which the quality of educational assessments are judged, and a set of criteria that are sensitive to some of the expectations for performance-based assessments is proposed.
The purpose of this study was to investigate the power and Type I error rate of the likelihood ratio goodness‐of‐fit (LR) statistic in detecting differential item functioning (DIF) under Samejima's (1969, 1972) graded response model. A multiple‐replication Monte Carlo study was utilized in which DIF was modeled in simulated data sets which were then calibrated with MULTILOG (Thissen, 1991) using hierarchically nested item response models. In addition, the power and Type I error rate of the Mantel (1963) approach for detecting DIF in ordered response categories were investigated using the same simulated data, for comparative purposes. The power of both the Mantel and LR procedures was affected by sample size, as expected. The LR procedure lacked the power to consistently detect DIF when it existed in reference/focal groups with sample sizes as small as 500/500. The Mantel procedure maintained control of its Type I error rate and was more powerful than the LR procedure when the comparison group ability distributions were identical and there was a constant DIF pattern. On the other hand, the Mantel procedure lost control of its Type I error rate, whereas the LR procedure did not, when the comparison groups differed in mean ability; and the LR procedure demonstrated a profound power advantage over the Mantel procedure under conditions of balanced DIF in which the comparison group ability distributions were identical. The choice and subsequent use of any procedure requires a thorough understanding of the power and Type I error rates of the procedure under varying conditions of DIF pattern, comparison group ability distributions.–or as a surrogate, observed score distributions–and item characteristics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.