Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment

Lai, U Hin; Wu, Keng Sam; Hsu, Ting-Yu; Kan, Jessie Kai Ching

doi:10.3389/fmed.2023.1240915

Cited by 16 publications

(17 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Variability in ChatGPT performance across varying disciplines was shown in previous studies as follows. A recent study by Lai et al showed that ChatGPT-4 had an average score of 76.3% in the United Kingdom Medical Licensing Assessment, a national undergraduate medical exit exam (Lai et al, 2023). Importantly, the study revealed varied performance across medical specialties, with weaker results in gastrointestinal/hepatology, endocrine/metabolic, and clinical hematology domains as opposed to better performance in the mental health, cancer, and cardiovascular domains (Lai et al, 2023).…”

Section: Discussionmentioning

confidence: 99%

“…A recent study by Lai et al showed that ChatGPT-4 had an average score of 76.3% in the United Kingdom Medical Licensing Assessment, a national undergraduate medical exit exam (Lai et al, 2023). Importantly, the study revealed varied performance across medical specialties, with weaker results in gastrointestinal/hepatology, endocrine/metabolic, and clinical hematology domains as opposed to better performance in the mental health, cancer, and cardiovascular domains (Lai et al, 2023). Additionally, a similar discrepancy in ChatGPT-4 performance across medical subjects (albeit lacking statistical significance) was noticed in a study by Gobira et al which utilized the 2022 Brazilian National Examination for Medical Degree Revalidation, with worse performance in preventive medicine (Gobira et al, 2023).…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Below average ChatGPT performance in medical microbiology exam compared to university students

Sallam,

Al-Salahat

2023

Front. Educ.

View full text Add to dashboard Cite

BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Below average ChatGPT performance in medical microbiology exam compared to university students

Sallam,

Al-Salahat

2023

Front. Educ.

View full text Add to dashboard Cite

show abstract

“… 19 ChatGPT performed well (76.3%) on the UKMLA. 20 In contrast, the performance of ChatGPT vs dental students on a medical microbiology MCQ exam found that ChatGPT 3.5 correctly answered 64 out of 80 MCQs (80%), scoring 80.5 out of 100 which was below the student average of 86.21 out of 100. 29…”

Section: Discussionmentioning

confidence: 93%

Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom’s Taxonomy

Bharatha,

Ojeh,

Fazle Rabbi

et al. 2024

AMEP

View full text Add to dashboard Cite

Introduction This research investigated the capabilities of ChatGPT-4 compared to medical students in answering MCQs using the revised Bloom’s Taxonomy as a benchmark. Methods A cross-sectional study was conducted at The University of the West Indies, Barbados. ChatGPT-4 and medical students were assessed on MCQs from various medical courses using computer-based testing. Results The study included 304 MCQs. Students demonstrated good knowledge, with 78% correctly answering at least 90% of the questions. However, ChatGPT-4 achieved a higher overall score (73.7%) compared to students (66.7%). Course type significantly affected ChatGPT-4’s performance, but revised Bloom’s Taxonomy levels did not. A detailed association check between program levels and Bloom’s taxonomy levels for correct answers by ChatGPT-4 showed a highly significant correlation (p<0.001), reflecting a concentration of “remember-level” questions in preclinical and “evaluate-level” questions in clinical courses. Discussion The study highlights ChatGPT-4’s proficiency in standardized tests but indicates limitations in clinical reasoning and practical skills. This performance discrepancy suggests that the effectiveness of artificial intelligence (AI) varies based on course content. Conclusion While ChatGPT-4 shows promise as an educational tool, its role should be supplementary, with strategic integration into medical education to leverage its strengths and address limitations. Further research is needed to explore AI’s impact on medical education and student performance across educational levels and courses.

show abstract

“…Extensive research has shown that ChatGPT, particularly its most recent version GPT-4, excels across various standardized tests. This includes the United States Medical Licensing Examination [ 22 , 23 , 24 , 25 ]; medical licensing tests from different countries [ 26 , 27 , 28 , 29 , 30 ]; and exams related to specific fields such as psychiatry [ 31 ], nursing [ 32 ], dentistry [ 33 ], pathology [ 34 ], pharmacy [ 35 ], urology [ 36 ], gastroenterology [ 37 ], parasitology [ 38 ], and ophthalmology [ 39 ]. Additionally, there is evidence of ChatGPT’s ability to create discharge summaries and operative reports [ 40 , 41 ], record patient histories of present illness [ 42 ], and enhance the documentation process for informed consent [ 43 ], although its effectiveness requires further improvement.…”

Section: Introductionmentioning

confidence: 99%

Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications

Miao,

Thongprayoon,

Suppadungsuk

et al. 2024

Medicina

View full text Add to dashboard Cite

The integration of large language models (LLMs) into healthcare, particularly in nephrology, represents a significant advancement in applying advanced technology to patient care, medical research, and education. These advanced models have progressed from simple text processors to tools capable of deep language understanding, offering innovative ways to handle health-related data, thus improving medical practice efficiency and effectiveness. A significant challenge in medical applications of LLMs is their imperfect accuracy and/or tendency to produce hallucinations—outputs that are factually incorrect or irrelevant. This issue is particularly critical in healthcare, where precision is essential, as inaccuracies can undermine the reliability of these models in crucial decision-making processes. To overcome these challenges, various strategies have been developed. One such strategy is prompt engineering, like the chain-of-thought approach, which directs LLMs towards more accurate responses by breaking down the problem into intermediate steps or reasoning sequences. Another one is the retrieval-augmented generation (RAG) strategy, which helps address hallucinations by integrating external data, enhancing output accuracy and relevance. Hence, RAG is favored for tasks requiring up-to-date, comprehensive information, such as in clinical decision making or educational applications. In this article, we showcase the creation of a specialized ChatGPT model integrated with a RAG system, tailored to align with the KDIGO 2023 guidelines for chronic kidney disease. This example demonstrates its potential in providing specialized, accurate medical advice, marking a step towards more reliable and efficient nephrology practices.

show abstract

Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment

Cited by 16 publications

References 32 publications

Below average ChatGPT performance in medical microbiology exam compared to university students

Below average ChatGPT performance in medical microbiology exam compared to university students

Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom’s Taxonomy

Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications

Contact Info

Product

Resources

About