The performance of two polytomous item response theory models was compared to that of the dichotomous three-parameter logistic model in the context of equating tests composed of testlets. For the polytomous models, testlet scores were used to eliminate the effect of the dependence among within-testlet items. Traditional equating methods were used as criteria for both. The equating methods based on polytomous models were found to produce results that more closely agreed with the results of traditional methods.
How should we think about the concept of the testlet? How can testlets be better incorporated into test score analysis? Can there be a one‐item testlet?
The purpose of this study was to investigate the methods of estimating the reliability of school-level scores using generalizability theory and multilevel models. Two approaches, 'student within schools' and 'students within schools and subject areas,' were conceptualized and implemented in this study. Four methods resulting from the combination of these two approaches with generalizability theory and multilevel models were compared for both balanced and unbalanced data. The generalizability theory and multilevel models for the 'students within schools' approach produced the same variance components and reliability estimates for the balanced data, while failing to do so for the unbalanced data. The different results from the two models can be explained by the fact that they administer different procedures in estimating the variance components used, in turn, to estimate reliability. Among the estimation methods investigated in this study, the generalizability theory model with the 'students nested within schools crossed with subject areas' design produced the lowest reliability estimates. Fully nested designs such as (students:schools) or (subject areas:students:schools) would not have any significant impact on reliability estimates of school-level scores. Both methods provide very similar reliability estimates of school-level scores.
The bookmark standard-setting procedure is an item response theory–based method that is widely implemented in state testing programs. This study estimates standard errors for cut scores resulting from bookmark standard settings under a generalizability theory model and investigates the effects of different universes of generalization and error sources on standard errors. This study produced several notable results. First, different patterns of variance component estimates are found for different cut scores; therefore, researchers should estimate separate variance components for each cut score and use them to estimate corresponding standard errors. Second, different universes of generalization produce different standard error estimates; thus, policy makers should consider which universe is appropriate for the proposed use of cut scores. Third, participants and groups have nonnegligible effects on several error sources. To decrease the standard errors for cut scores, increasing the number of small groups seems more efficient than increasing the number of participants.
The primary purpose of this study was to investigate the appropriateness and implication of incorporating a testlet definition into the estimation of procedures of the conditional standard error of measurement (SEM) for tests composed of testlets. Another purpose was to investigate the bias in estimates of the conditional SEM when using item‐based methods instead of testlet‐based methods. Several item‐based and testlet‐based estimation methods were proposed and compared. In general, item‐based estimation methods underestimated the conditional SEM for tests composed for testlets, and the magnitude of this negative bias increased as the degree of conditional dependence among items within testlets increased. However, an item‐based method using a generalizability theory model provided good estimates of the conditional SEM under mild violation of the assumptions for measurement modeling. Under moderate or somewhat severe violation, testlet‐based methods with item response models provided good estimates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.