Can ChatGPT-3.5 Pass a Medical Exam? A Systematic Review of ChatGPT's Performance in Academic Testing

Sumbal, Anusha; Sumbal, Ramish; Amir, Alina

doi:10.1177/23821205241238641

Cited by 12 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An extensive body of literature has found that LLMs, such as ChatGPT, can successfully pass medical examinations [28], even though with varying degrees of heterogeneity and variability [29], exhibiting strong abilities in explanation, reasoning, memory, and accuracy. On the other hand, LLMs struggle with image-based questions [30] and, in some circumstances, lack insight and critical thinking skills [31]. Some of the studies that exploit quizzes/vignettes/validated knowledge surveys [32,33] have quantified the fluency and accuracy of AI-based tools using validated and reliable instruments, like the "Artificial Intelligence Performance Instrument" (AIPI) [32].…”

Section:  the Quiz/vignette/knowledge Survey Paradigmmentioning

confidence: 99%

“…An extensive body of literature has found that LLMs such as ChatGPT can successfully pass medical examinations [ 28 ] although with varying degrees of heterogeneity and variability [ 29 ], exhibiting strong abilities in explanation, reasoning, memory, and accuracy. On the other hand, LLMs struggle with image-based questions [ 30 ] and, in some circumstances, lack insight and critical thinking skills [ 31 ].…”

Section: Implementing “Verification Paradigms”: a Comprehensive Evalu...mentioning

confidence: 99%

See 1 more Smart Citation

Toward Clinical Generative AI: Conceptual Framework

Bragazzi,

Garbarino

2024

JMIR AI

View full text Add to dashboard Cite

Clinical decision-making is a crucial aspect of healthcare, involving the balanced integration of scientific evidence, clinical judgment, ethical considerations, and patient involvement. This process is dynamic and multifaceted, relying on clinicians' knowledge, experience, and intuitive understanding to achieve optimal patient outcomes through informed, evidence-based choices. The advent of generative Artificial Intelligence (AI) presents a revolutionary opportunity in clinical decision-making. AI's advanced data analysis and pattern recognition capabilities can significantly enhance the diagnosis and treatment of diseases, processing vast medical data to identify patterns, tailor treatments, predict disease progression, and aid in proactive patient management. However, the incorporation of AI into clinical decision-making raises concerns regarding the reliability and accuracy of AI-generated insights. To address these concerns, eleven "verification paradigms" are here proposed, with each paradigm offering unique methods to verify the evidence-based nature of AI in clinical decision-making. The paper also frames the concept of "clinically explainable, fair, and responsible, clinician-, expert-, and patient-in-the-loop AI". This model focuses on ensuring AI's comprehensibility, collaborative nature, and ethical grounding, advocating for AI to serve as an augmentative tool, with its decision-making processes being transparent and understandable to clinicians and patients. The integration of AI should enhance, not replace, the clinician's judgment and should involve continuous learning and adaptation based on real-world outcomes and ethical and legal compliance. In conclusion, while generative AI holds immense promise in enhancing clinical decision-making, it is essential to ensure that it produces evidence-based, reliable, and impactful knowledge. Employing the outlined paradigms and approaches can help the medical and patient communities harness AI's potential while maintaining high patient care standards.

show abstract

Section:  the Quiz/vignette/knowledge Survey Paradigmmentioning

confidence: 99%

Section: Implementing “Verification Paradigms”: a Comprehensive Evalu...mentioning

confidence: 99%

Toward Clinical Generative AI: Conceptual Framework

Bragazzi,

Garbarino

2024

JMIR AI

View full text Add to dashboard Cite

show abstract

“…To the best of our knowledge, three systematic reviews have explored ChatGPT's performance in medical licensing exams [58][59][60].…”

Section: Literature Reviewmentioning

confidence: 99%

“…A study from Pakistan collected literature up to April 2023, focusing on the performance of GPT-3.5 in various medical licensing exams worldwide [59]. However, with the advent of the more advanced GPT-4, more studies have focused on GPT-4.…”

Section: Literature Reviewmentioning

confidence: 99%

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Liu,

Okuhara,

Chang

et al. 2024

J Med Internet Res

View full text Add to dashboard Cite

Background Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT’s performance on different medical licensing examinations. Objective In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education. Methods We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses. Results A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5’s performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non–English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5’s (P=.03) and GPT-4’s (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT’s accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs. Conclusions GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education. Trial Registration PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687

show abstract

“…[6][7][8][9][10][11][12][13] As a more objective and generalizable benchmark for performance, studies have also explored LLMs' impressive performance on standardized clinical examinations such as the United States Medical Licensing Examination (USMLE) and specialty examinations. [14][15][16][17][18][19][20][21] Advancements in the capabilities of LLMs, such as image recognition, have opened new avenues for innovation and research into their potential applications in clinical care. Furthermore, model prompting strategies, such as prompt engineering, few-shot learning, and retrieval augmented generation (RAG) have provided promise in enhancing the performance of generalist foundation models on science and general medical knowledge benchmarks.…”

Section: Introductionmentioning

confidence: 99%

Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance

Samaan,

Margolis,

Srinivasan

et al. 2024

Preprint

View full text Add to dashboard Cite

Background: Large language models (LLMs) have shown promise in answering medical licensing examination-style questions. However, there is limited research on the performance of multimodal LLMs on subspecialty medical examinations. Our study benchmarks the performance of multimodal LLMs enhanced by model prompting strategies on gastroenterology subspecialty examination-style questions and examines how these prompting strategies incrementally improve overall performance. Methods: We used the 2022 American College of Gastroenterology (ACG) self-assessment examination (N=300). This test is typically completed by gastroenterology fellows and established gastroenterologists preparing for the gastroenterology subspecialty board examination. We employed a sequential implementation of model prompting strategies: prompt engineering, Retrieval-Augmented Generation (RAG), five-shot learning, and an LLM-powered answer validation revision model (AVRM). GPT-4 and Gemini Pro were tested. Results: Implementing all prompting strategies improved the overall score of GPT-4 from 60.3% to 80.7% and Gemini Pro from 48.0% to 54.3%. GPT-4's score surpassed the 70% passing threshold and 75% average human test-taker scores unlike Gemini Pro. Stratification of questions by difficulty showed the accuracy of both LLMs mirrored that of human examinees, demonstrating higher accuracy as human test-taker accuracy increased. The addition of the AVRM to prompt, RAG, and 5-shot increased GPT-4's accuracy by 4.4%. The incremental addition of model prompting strategies improved accuracy for both non-image (57.2% to 80.4%) and image-based (63.0% to 80.9%) questions for GPT-4, but not Gemini Pro. Conclusions: Our results underscore the value of model prompting strategies in improving LLM performance on subspecialty-level licensing exam questions. We also present a novel implementation of an LLM-powered reviewer model in the context of subspecialty medicine which further improved model performance when combined with other prompting strategies. Our findings highlight the potential future role of multimodal LLMs, particularly with the implementation of multiple model prompting strategies, as clinical decision support systems in subspecialty care for healthcare providers. Keywords: ChatGPT, Gemini pro, gastroenterology, RAG, prompt engineering, medical specialty examination.

show abstract

Can ChatGPT-3.5 Pass a Medical Exam? A Systematic Review of ChatGPT's Performance in Academic Testing

Cited by 12 publications

References 29 publications

Toward Clinical Generative AI: Conceptual Framework

Toward Clinical Generative AI: Conceptual Framework

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance

Contact Info

Product

Resources

About