Abstract:This mini review summarizes the current state of knowledge about automatic item generation in the context of educational assessment and discusses key points in the item generation pipeline. Assessment is critical in all learning systems and digitalized assessments have shown significant growth over the last decade. This leads to an urgent need to generate more items in a fast and efficient manner. Continuous improvements in computational power and advancements in methodological approaches, specifically in the … Show more
“…Using AI and AI-driven applications, tools, and techniques can decrease the challenge of traditional methods presented as item construction and item stability [ 23 ]. Through AI, it will be easier and more feasible to construct and update items (questions) and form item banks.…”
Background
The introduction of competency-based education models, student centers, and the increased use of formative assessments have led to demands for high-quality test items to be used in assessments. This study aimed to assess the use of an AI tool to generate MCQs type A and evaluate its quality.
Methods
The study design was cross-sectional analytics conducted from June 2023 to August 2023. This study utilized formative TBL. The AI tool (ChatPdf.com) was selected to generate MCQs type A. The generated items were evaluated using a questionnaire for subject experts and an item (psychometric) analysis. The questionnaire to the subject experts about items was formed based on item quality and rating of item difficulty.
Results
The total number of recurrent staff members as experts was 25, and the questionnaire response rate was 68%. The quality of the items ranged from good to excellent. None of the items had scenarios or vignettes and were direct. According to the expert’s rating, easy items represented 80%, and only two had moderate difficulty (20%). Only one item out of the two moderate difficulties had the same difficulty index. The total number of students participating in TBL was 48. The mean mark was 4.8 ± 1.7 out of 10. The KR20 is 0.68. Most items were of moderately difficult (90%) and only one was difficult (10%). The discrimination index of the items ranged from 0.77 to 0.15. Items with excellent discrimination represented 50% (5), items with good discrimination were 3 (30%), and only one time was poor (10%), and one was none discriminating. The non-functional distractors were 26 (86.7%), and the number of non-functional distractors was four (13.3%). According to distractor analysis, 60% of the items were excellent, and 40% were good. A significant correlation (p = 0.4, r = 0.30) was found between the difficulty and discrimination indices.
Conclusion
Items constructed using AI had good psychometric properties and quality, measuring higher-order domains. AI allows the construction of many items within a short time. We hope this paper brings the use of AI in item generation and the associated challenges into a multi-layered discussion that will eventually lead to improvements in item generation and assessment in general.
“…Using AI and AI-driven applications, tools, and techniques can decrease the challenge of traditional methods presented as item construction and item stability [ 23 ]. Through AI, it will be easier and more feasible to construct and update items (questions) and form item banks.…”
Background
The introduction of competency-based education models, student centers, and the increased use of formative assessments have led to demands for high-quality test items to be used in assessments. This study aimed to assess the use of an AI tool to generate MCQs type A and evaluate its quality.
Methods
The study design was cross-sectional analytics conducted from June 2023 to August 2023. This study utilized formative TBL. The AI tool (ChatPdf.com) was selected to generate MCQs type A. The generated items were evaluated using a questionnaire for subject experts and an item (psychometric) analysis. The questionnaire to the subject experts about items was formed based on item quality and rating of item difficulty.
Results
The total number of recurrent staff members as experts was 25, and the questionnaire response rate was 68%. The quality of the items ranged from good to excellent. None of the items had scenarios or vignettes and were direct. According to the expert’s rating, easy items represented 80%, and only two had moderate difficulty (20%). Only one item out of the two moderate difficulties had the same difficulty index. The total number of students participating in TBL was 48. The mean mark was 4.8 ± 1.7 out of 10. The KR20 is 0.68. Most items were of moderately difficult (90%) and only one was difficult (10%). The discrimination index of the items ranged from 0.77 to 0.15. Items with excellent discrimination represented 50% (5), items with good discrimination were 3 (30%), and only one time was poor (10%), and one was none discriminating. The non-functional distractors were 26 (86.7%), and the number of non-functional distractors was four (13.3%). According to distractor analysis, 60% of the items were excellent, and 40% were good. A significant correlation (p = 0.4, r = 0.30) was found between the difficulty and discrimination indices.
Conclusion
Items constructed using AI had good psychometric properties and quality, measuring higher-order domains. AI allows the construction of many items within a short time. We hope this paper brings the use of AI in item generation and the associated challenges into a multi-layered discussion that will eventually lead to improvements in item generation and assessment in general.
“…Many items can be generated for a specific topic based on a single cognitive model (Gierl et al, 2012), and the models are standards in measurement theories, allowing the developed tests to serve the assessment purposes of validity, reliability, fairness, and quality. Moreover, AIG is known to make test and assessment development easier by making it quicker to create items, reducing the cost of item creation, helping to continuously and rapidly develop a large pool of items, and tailoring items to fit individual learning needs for better outcomes (Circi et al 2023).…”
Section: Ai Item Generationmentioning
confidence: 99%
“…In educational settings, there has been a growing demand for the rapid generation of assessment items to accommodate continuous testing requirements (Kurdi et al, 2019). This shift has posed challenges to traditional test item creation methods and to the maintenance of test item bank stability (Circi et al, 2023). Finding high-quality test items has consistently proven difficult, with the manual creation of items being timeconsuming and costly (Gehringer, 2004).…”
Section: Introductionmentioning
confidence: 99%
“…Many teachers encounter difficulties designing quality items for each assessment, often resorting to item reuse across terms (Gehringer, 2004;Wellberg, 2023). However, this practice may lead to issues such as students memorizing answers without engaging with the content and the risk of cheating through item over-exposure (Circi et al, 2023;Gehringer, 2004).…”
This study examines the efficacy of artificial intelligence (AI) in creating parallel test items compared to human-made ones. Two test forms were developed: one consisting of 20 existing human-made items and another with 20 new items generated with ChatGPT assistance. Expert reviews confirmed the content parallelism of the two test forms. Forty-three university students then completed the 40 test items presented randomly from both forms on a final test. Statistical analyses of student performance indicated comparability between the AI-human-made and human-made test forms. Despite limitations such as sample size and reliance on classical test theory (CTT), the findings suggest ChatGPT’s potential to assist teachers in test item creation, reducing workload and saving time. These results highlight ChatGPT’s value in educational assessment and emphasize the need for further research and development in this area.
“…In that moment such tools were the specific softwares (e.g., IGOR), which allowed psychometricians to create AIG algorithms (Mortimer et al, 2012). The most common strategy was model-based AIG, which consists of creating an item template with certain modifiable elements (the part that can be automatically generated), based on week or strong cognitive models involved in the response process (Circi et al, 2023). Since this method is hard to apply to the nature of non-cognitive items, AIG has been focused on test formats such as multiple-choice to measure specific knowledge and skills (Gierl & Lai, 2018;Gierl et al, 2008;Pugh et al, 2016), or intelligence (K. Wang & Su, 2015).…”
Section: Wondering: Llm As An Item Generatormentioning
The irruption of Large Language Models (LLMs) in our daily lives has opened up an intriguing future for the course of psychometrics. We give a glimpse of that future through our seven “wonderings” of LLMs: a series of wanderings on how current LLMs can instill wonder in researchers and professionals by assisting them in each step of the design, refinement, and analysis of psychometric tools. Using GPT-4 as illustration, we have tried to answer what are the capabilities of LLMs as item designers and format generators, as reviewers and respondents, and as data analysts and results interpreters. We interacted with the LLM applying a systematic prompt scheme, evidencing the peaks and pitfalls of its responses when addressing psychometric tasks. Finally, we provide some thoughts and guidelines about the validity of the uses LLMs responses can offer, and how to study and perform such validation process.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.