Automatic item generation: foundations and machine learning-based approaches for assessments

Circi, Ruhan; Hicks, Juanita; Sikali, Emmanuel

doi:10.3389/feduc.2023.858273

Cited by 6 publications

(5 citation statements)

References 39 publications

(80 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using AI and AI-driven applications, tools, and techniques can decrease the challenge of traditional methods presented as item construction and item stability [ 23 ]. Through AI, it will be easier and more feasible to construct and update items (questions) and form item banks.…”

Section: Introductionmentioning

confidence: 99%

AI in medical education: uses of AI in construction type A MCQs

Rezigalla

2024

BMC Med Educ

View full text Add to dashboard Cite

Background The introduction of competency-based education models, student centers, and the increased use of formative assessments have led to demands for high-quality test items to be used in assessments. This study aimed to assess the use of an AI tool to generate MCQs type A and evaluate its quality. Methods The study design was cross-sectional analytics conducted from June 2023 to August 2023. This study utilized formative TBL. The AI tool (ChatPdf.com) was selected to generate MCQs type A. The generated items were evaluated using a questionnaire for subject experts and an item (psychometric) analysis. The questionnaire to the subject experts about items was formed based on item quality and rating of item difficulty. Results The total number of recurrent staff members as experts was 25, and the questionnaire response rate was 68%. The quality of the items ranged from good to excellent. None of the items had scenarios or vignettes and were direct. According to the expert’s rating, easy items represented 80%, and only two had moderate difficulty (20%). Only one item out of the two moderate difficulties had the same difficulty index. The total number of students participating in TBL was 48. The mean mark was 4.8 ± 1.7 out of 10. The KR20 is 0.68. Most items were of moderately difficult (90%) and only one was difficult (10%). The discrimination index of the items ranged from 0.77 to 0.15. Items with excellent discrimination represented 50% (5), items with good discrimination were 3 (30%), and only one time was poor (10%), and one was none discriminating. The non-functional distractors were 26 (86.7%), and the number of non-functional distractors was four (13.3%). According to distractor analysis, 60% of the items were excellent, and 40% were good. A significant correlation (p = 0.4, r = 0.30) was found between the difficulty and discrimination indices. Conclusion Items constructed using AI had good psychometric properties and quality, measuring higher-order domains. AI allows the construction of many items within a short time. We hope this paper brings the use of AI in item generation and the associated challenges into a multi-layered discussion that will eventually lead to improvements in item generation and assessment in general.

show abstract

Section: Introductionmentioning

confidence: 99%

AI in medical education: uses of AI in construction type A MCQs

Rezigalla

2024

BMC Med Educ

View full text Add to dashboard Cite

show abstract

“…Many items can be generated for a specific topic based on a single cognitive model (Gierl et al, 2012), and the models are standards in measurement theories, allowing the developed tests to serve the assessment purposes of validity, reliability, fairness, and quality. Moreover, AIG is known to make test and assessment development easier by making it quicker to create items, reducing the cost of item creation, helping to continuously and rapidly develop a large pool of items, and tailoring items to fit individual learning needs for better outcomes (Circi et al 2023).…”

Section: Ai Item Generationmentioning

confidence: 99%

“…In educational settings, there has been a growing demand for the rapid generation of assessment items to accommodate continuous testing requirements (Kurdi et al, 2019). This shift has posed challenges to traditional test item creation methods and to the maintenance of test item bank stability (Circi et al, 2023). Finding high-quality test items has consistently proven difficult, with the manual creation of items being timeconsuming and costly (Gehringer, 2004).…”

Section: Introductionmentioning

confidence: 99%

“…Many teachers encounter difficulties designing quality items for each assessment, often resorting to item reuse across terms (Gehringer, 2004;Wellberg, 2023). However, this practice may lead to issues such as students memorizing answers without engaging with the content and the risk of cheating through item over-exposure (Circi et al, 2023;Gehringer, 2004).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A comparative study of AI-human-made and human-made test forms for a university TESOL theory course

2024

Lang Test Asia

View full text Add to dashboard Cite

This study examines the efficacy of artificial intelligence (AI) in creating parallel test items compared to human-made ones. Two test forms were developed: one consisting of 20 existing human-made items and another with 20 new items generated with ChatGPT assistance. Expert reviews confirmed the content parallelism of the two test forms. Forty-three university students then completed the 40 test items presented randomly from both forms on a final test. Statistical analyses of student performance indicated comparability between the AI-human-made and human-made test forms. Despite limitations such as sample size and reliance on classical test theory (CTT), the findings suggest ChatGPT’s potential to assist teachers in test item creation, reducing workload and saving time. These results highlight ChatGPT’s value in educational assessment and emphasize the need for further research and development in this area.

show abstract

“…In that moment such tools were the specific softwares (e.g., IGOR), which allowed psychometricians to create AIG algorithms (Mortimer et al, 2012). The most common strategy was model-based AIG, which consists of creating an item template with certain modifiable elements (the part that can be automatically generated), based on week or strong cognitive models involved in the response process (Circi et al, 2023). Since this method is hard to apply to the nature of non-cognitive items, AIG has been focused on test formats such as multiple-choice to measure specific knowledge and skills (Gierl & Lai, 2018;Gierl et al, 2008;Pugh et al, 2016), or intelligence (K. Wang & Su, 2015).…”

Section: Wondering: Llm As An Item Generatormentioning

confidence: 99%

The Seven Wonderings of Large Language Models as Psychometric Designers, Refiners, and Analysts

Franco-Martínez,

Rey-Sáez,

Castillejo

2023

Preprint

View full text Add to dashboard Cite

The irruption of Large Language Models (LLMs) in our daily lives has opened up an intriguing future for the course of psychometrics. We give a glimpse of that future through our seven “wonderings” of LLMs: a series of wanderings on how current LLMs can instill wonder in researchers and professionals by assisting them in each step of the design, refinement, and analysis of psychometric tools. Using GPT-4 as illustration, we have tried to answer what are the capabilities of LLMs as item designers and format generators, as reviewers and respondents, and as data analysts and results interpreters. We interacted with the LLM applying a systematic prompt scheme, evidencing the peaks and pitfalls of its responses when addressing psychometric tasks. Finally, we provide some thoughts and guidelines about the validity of the uses LLMs responses can offer, and how to study and perform such validation process.

show abstract

Automatic item generation: foundations and machine learning-based approaches for assessments

Cited by 6 publications

References 39 publications

AI in medical education: uses of AI in construction type A MCQs

AI in medical education: uses of AI in construction type A MCQs

A comparative study of AI-human-made and human-made test forms for a university TESOL theory course

The Seven Wonderings of Large Language Models as Psychometric Designers, Refiners, and Analysts

Contact Info

Product

Resources

About