Automatic item generation (AIG)—a means of leveraging technology to create large quantities of items—requires a minimum number of items to offset the sizable upfront investment (i.e., model development and technology deployment) in order to achieve cost savings. In this cost–benefit analysis, we estimated the cost of each step of AIG and manual item writing and applied cost—benefit formulas to calculate the number of items that would have to be produced before the upfront costs of AIG outweigh manual item writing costs in the context of K‐12 mathematics items. Results indicated that AIG is more cost‐effective than manual item writing when developing, at a minimum, 173 to 247 items within one fine‐grained content area (e.g., fourth‐ through seventh‐grade area of figures). The article concludes with a discussion of implications for test developers and the nonmonetary tradeoffs involved in AIG.
Alignment is an essential piece of validity evidence for both educational (K‐12) and credentialing (licensure and certification) assessments. In this article, a comprehensive review of commonly used contemporary alignment procedures is provided; some key weaknesses in current alignment approaches are identified; principles for evaluating alignment methods are distilled; and a new approach to investigating alignment is proposed and illustrated. The article concludes with suggestions for alignment research and practice.
We explored the feasibility of using automated scoring to assess upper‐elementary students’ reading ability through analysis of transcripts of students’ small‐group discussions about texts. Participants included 35 fourth‐grade students across two classrooms that engaged in a literacy intervention called Quality Talk. During the course of one school year, data were collected at 10 time points for a total of 327 student‐text encounters, with a different text discussed at each time point. To explore the possibility of automated scoring, we considered which quantitative discourse variables (e.g., variables to measure language sophistication and latent semantic analysis variables) were the strongest predictors of scores on a multiple‐choice and constructed‐response reading comprehension test. Convergent validity evidence was collected by comparing automatically calculated quantitative discourse features to scores on a reading fluency test. After examining a variety of discourse features using multilevel modeling, results showed that measures of word rareness and word diversity were the most promising variables to use in automated scoring of students’ discussions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.