Feasibility assurance: a review of automatic item generation in medical assessment

Falcão, Filipe; Costa, Patrício; Pêgo, José M.

doi:10.1007/s10459-022-10092-z

Cited by 13 publications

(4 citation statements)

References 33 publications

(95 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A fundamental issue for AIG research regards the validation of its processes (Gierl et al, 2022b;Shin, 2021). To date, few research has been devoted to collecting evidence to support the validity of AIG (Falcão et al, 2022;Gierl et al, 2022b;Rafatbakhsh et al, 2020). Since incorporating cognitive models into test design and development is required to support validity arguments for test-based inferences, we have reason to suppose that AIG incorporates validity evidence in its methods (Gierl et al, 2022b;Leighton & Gierl, 2011).…”

Section: Aig Versus Manual Item Writingmentioning

confidence: 99%

A suggestive approach for assessing item quality, usability and validity of Automatic Item Generation

Falcão

Pereira

Gonçalves³

et al. 2023

Adv in Health Sci Educ

Self Cite

View full text Add to dashboard Cite

Automatic Item Generation (AIG) refers to the process of using cognitive models to generate test items using computer modules. It is a new but rapidly evolving research area where cognitive and psychometric theory are combined into digital framework. However, assessment of the item quality, usability and validity of AIG relative to traditional item development methods lacks clarification. This paper takes a top-down strong theory approach to evaluate AIG in medical education. Two studies were conducted: Study I—participants with different levels of clinical knowledge and item writing experience developed medical test items both manually and through AIG. Both item types were compared in terms of quality and usability (efficiency and learnability); Study II—Automatically generated items were included in a summative exam in the content area of surgery. A psychometric analysis based on Item Response Theory inspected the validity and quality of the AIG-items. Items generated by AIG presented quality, evidences of validity and were adequate for testing student’s knowledge. The time spent developing the contents for item generation (cognitive models) and the number of items generated did not vary considering the participants' item writing experience or clinical knowledge. AIG produces numerous high-quality items in a fast, economical and easy to learn process, even for inexperienced and without clinical training item writers. Medical schools may benefit from a substantial improvement in cost-efficiency in developing test items by using AIG. Item writing flaws can be significantly reduced thanks to the application of AIG's models, thus generating test items capable of accurately gauging students' knowledge.

show abstract

Section: Aig Versus Manual Item Writingmentioning

confidence: 99%

A suggestive approach for assessing item quality, usability and validity of Automatic Item Generation

Falcão

Pereira

Gonçalves³

et al. 2023

Adv in Health Sci Educ

Self Cite

View full text Add to dashboard Cite

show abstract

“…In his seminal work, Falcão clearly delimitates scoring as the procedures uses to develop AIG, generalization as the di culty measured in the test and extrapolation as the discrimination of the items. Therefore, we examine previous efforts to gather validity inferences on chatbots developed MCQs on the lens of Kane Framework proposed by (Cook et al, 2015), and Falcão (Falcão et al, 2022). A crosssectional study used ChatGPT, Google Bard and Microsoft Bing to develop MCQs for a physiology course, in this study a careful blueprint was mapped by two content experts; inferences of scoring and generalization were collected.…”

Section: Literature Reviewmentioning

confidence: 99%

“…Kane proposes four inferences: 1) Scoring, which is marked by the construction of an item in terms of its administration, ranging from the format of the test (i.e., multiple-choice-questions, skills evaluation) to the procedures planned to administer the test (i.e., training of raters, facilities needed); 2) Generalization, refers to the degree in which what is assessed (i.e., ten multiple-choice-questions based on the cardiology module) represent what should be assessed (I.e., the material of the cardiology module), this process may be aided by using a test blueprint or using reliability indices; 3) Extrapolation, is the relation between the test performance and real-world performance, this inference requires that the test theoretically re ects real-world performance (i.e., evaluate the test with content experts) or empirically (i.e., identifying the correlation between the test and workplace assessments); and 4) Implications, which measures real-world impact of the assessment using a cost-effectiveness approach. To further understand the application for Kane validity framework, a recent review conducted on the use of Automatic Item Generation (AIG) may be adequate (Falcão et al, 2022). In his seminal work, Falcão clearly delimitates scoring as the procedures uses to develop AIG, generalization as the di culty measured in the test and extrapolation as the discrimination of the items.…”

Section: Literature Reviewmentioning

confidence: 99%

“…However, developing high-quality MCQs requires human expertise, time, and money, which is often lacking in some educational settings (Gierl & Haladyna, 2013). Automatic Item Generation (AIG) has been used to surpass these limitations to develop several MCQs (Falcão et al, 2022). However, this approach requires content experts and speci c technical knowledge to implement (Gierl et al, 2012).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Using chatbots to develop multiple-choice questions. We got evidence, but we ain't there yet!

Flores-Cohaila,

Calderón,

Castro-Blancas

et al. 2023

Preprint

View full text Add to dashboard Cite

Developing accessible assessment tools is crucial for educators. Traditional methods demand significant resources such as time and expertise. Therefore, an accessible, user-friendly approach is needed. Traditional assessment creation faces challenges, however, new solutions like automatic item generation have emerged. Despite their potential, they still require expert knowledge. ChatGPT and similar chatbots offer a novel approach in this field. Our study evaluates the validity of MCQs generated by chatbots under the Kane validity framework. We focused on the top ten topics in Infectious and Tropical diseases, chosen based on epidemiological data and expert evaluations. These topics were transformed into learning objectives for chatbots like GPT-4, BingAI, and Claude to generate MCQs. Each chatbot produced 10 MCQs, which were subsequently refined. We compared 30 chatbot-generated MCQs with 10 from a Peruvian medical examination. The participants included 48 medical students and doctors from Peru. Our analysis revealed that the quality of chatbot-generated MCQs is consistent with those created by humans. This was evident in scoring inferences, with no significant differences in difficulty and discrimination indexes. In conclusion, chatbots appear to be a viable tool for creating MCQs in the field of infectious and tropical diseases in Peru. Although our study confirms their validity, further research is necessary to optimize their use in educational assessments.

show abstract

ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam

Kıyak,

Coşkun,

Budakoğlu

et al. 2024

Eur J Clin Pharmacol

View full text Add to dashboard Cite

Feasibility assurance: a review of automatic item generation in medical assessment

Cited by 13 publications

References 33 publications

A suggestive approach for assessing item quality, usability and validity of Automatic Item Generation

A suggestive approach for assessing item quality, usability and validity of Automatic Item Generation

Using chatbots to develop multiple-choice questions. We got evidence, but we ain't there yet!

ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam

Contact Info

Product

Resources

About