Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study

Takagi, Soshi; Watari, Takashi; Erabi, Ayano; Sakaguchi, Kota

doi:10.2196/48002

Cited by 122 publications

(97 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent studies show GPT-4 outperformed GPT-3.5 by 24%–30% in various medical examinations. 13,14,21,23 These findings indicate a significant enhancement in the model's capabilities. However, a study using the American College of Gastroenterology Test found GPT-3.5 and GPT-4 had scores of 65% and 62%, respectively.…”

Section: Discussionmentioning

confidence: 90%

“…Yet, the AI model struggles with more complex tasks requiring advanced comprehension, analytical abilities, and precise calculations. As indicated by a number of studies, 16,[20][21][22] ChatGPT's limitations in handling scientific and mathematical applications, particularly those demanding high-level cognitive engagement, become evident. Fluctuations in accuracy may be linked to the nature of subfield questions, even without explicit categorization.…”

Section: Discussionmentioning

confidence: 99%

“…Recently, Google and DeepMind introduced their large language model, named PaLM 2, alongside a medical domain-specific version 24,25 However, it is important to note that even with these improvements, the accuracy of GPT-4 remains below 80% and falls short of the average scores achieved by medical students or examinees. 14,21 We also obtained relevant data regarding the performance of nephrologists and/or nephrology trainees on the Kidney Self-Assessment Program and Nephrology Self-Assessment Program tests from ASN. The Kidney Self-Assessment Program accuracy rate was 80%, obviously higher than the passing threshold.…”

Section: Discussionmentioning

confidence: 99%

See 2 more Smart Citations

Performance of ChatGPT on Nephrology Test Questions

Miao,

Thongprayoon,

Garcia Valencia

et al. 2023

CJASN

View full text Add to dashboard Cite

Background: ChatGPT is a novel tool that allows people to engage in conversations with an advanced machine learning model. ChatGPT's performance in the United States Medical Licensing Examination is comparable to a successful candidate’s performance. However, its performance in nephrology field remains undetermined. This study assessed ChatGPT's capabilities in answering nephrology test questions. Methods: Questions sourced from Nephrology Self-Assessment Program and Kidney Self-Assessment Program were used, each with multiple choice single answer questions. Questions containing visual elements were excluded. Each question bank was run twice using GPT-3.5 and GPT-4. Total accuracy rate, defined as the percentage of correct answers obtained by ChatGPT in either the first or second run, and the total concordance, defined as the percentage of identical answers provided by ChatGPT during both runs, regardless of their correctness, were used to assess its performance. Results: A comprehensive assessment was conducted on a set of 975 questions, comprising 508 questions from Nephrology Self-Assessment Program and 467 from Kidney Self-Assessment Program. GPT-3.5 resulted in a total accuracy rate of 51%. Notably, the employment of Nephrology Self-Assessment Program yielded a higher accuracy rate compared to Kidney Self-Assessment Program (58% vs. 44%; p<0.001). The total concordance rate across all questions was 78%, with correct answers exhibiting a higher concordance rate (84%) compared to incorrect answers (73%) (p<0.001). When examining various nephrology subfields, the total accuracy rates were relatively lower in electrolyte and acid-base disorder, glomerular disease, and kidney-related bone and stone disorders. The total accuracy rate of GPT-4’s response was 74%, higher than GPT-3.5 (p<0.001) but remained below the passing threshold and average scores of Nephrology examinees (77%). Conclusions: ChatGPT exhibited limitations regarding accuracy and repeatability when addressing nephrology-related questions. Variations in performance were evident across various subfields.

show abstract

Section: Discussionmentioning

confidence: 90%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Performance of ChatGPT on Nephrology Test Questions

Miao,

Thongprayoon,

Garcia Valencia

et al. 2023

CJASN

View full text Add to dashboard Cite

show abstract

“…GPT can also understand languages other than English. The latest model, GPT-4, has been reported to achieve passing scores in medical licensing examinations in non-English speaking countries such as Japan, China, Poland, and Peru [8][9][10][11][12][13].…”

Section: Introductionmentioning

confidence: 99%

Capability of GPT-4V(ision) in Japanese National Medical Licensing Examination

Nakao,

Miki,

Nakamura

et al. 2023

Preprint

View full text Add to dashboard Cite

BackgroundPrevious research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images.ObjectiveTo evaluate the capability of GPT-4V, a recent multimodal LLM developed by OpenAI, in recognizing images in the medical field by testing its capability to answer questions in the 117th Japanese National Medical Licensing Examination.MethodsWe focused on 108 questions that had one or more images as part of a question and presented GPT-4V with the same questions under two conditions: 1) with both the question text and associated image(s), and 2) with the question text only. We then compared the difference in accuracy between the two conditions using the exact McNemar’s test.ResultsAmong the 108 questions with images, GPT-4V’s accuracy was 68% when presented with images and 72% when presented without images (P= .36).ConclusionsThe additional information from the images did not significantly improve the performance of GPT-4V in the Japanese Medical Licensing Examination.

show abstract

“…Dozens of articles followed in a short time, focusing on the national medical licensing examinations of various countries and the board examinations of various specialties. 8,9…”

mentioning

confidence: 99%