Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study

Yanagita, Yasutaka; Yokokawa, Daiki; Uchida, Shun; Tawara, Junsuke; Ikusaka, Masatomi

doi:10.2196/48023

Cited by 35 publications

(29 citation statements)

References 16 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…6 Other studies have used more complex questions and clinical scenarios and reported accuracy rates ranging from 26.7% to 81.5% for the chatbot, depending on the specific methods employed or the version that was tested. 7,15,16 In common, the studies that compared the two versions of the chatbot have consistently shown superior performance for version 4. 7,15 In our study, open questions, with different levels of complexity, were posed to ChatGPT.…”

Section: Discussionmentioning

confidence: 99%

“…7,15,16 In common, the studies that compared the two versions of the chatbot have consistently shown superior performance for version 4. 7,15 In our study, open questions, with different levels of complexity, were posed to ChatGPT. We have not prompted the chatbot to base its answers on specific guidelines since we wanted to simulate "real world" scenarios, where a user, not necessarily an expert in F I G U R E 1 Performance of ChatGPT 3.5 and 4 in conceptual and case-based questions.…”

Section: Discussionmentioning

confidence: 99%

“…Several studies have assessed the reliability of ChatGPT, each using different methodologies and subsequently reporting varying levels of performance 5–11,15 . Studies using common questions of interest to patients and laypeople usually show high accuracy rates.…”

Section: Discussionmentioning

confidence: 99%

“…In a recent study evaluating the chatbot recommendations regarding breast cancer prevention and screening, an accuracy rate of 88% for ChatGPT 3.5 was reported 6 . Other studies have used more complex questions and clinical scenarios and reported accuracy rates ranging from 26.7% to 81.5% for the chatbot, depending on the specific methods employed or the version that was tested 7,15,16 . In common, the studies that compared the two versions of the chatbot have consistently shown superior performance for version 4 7,15 …”

Section: Discussionmentioning

confidence: 99%

“…Several studies have assessed the reliability of ChatGPT, each using different methodologies and subsequently reporting varying levels of performance. [5][6][7][8][9][10][11]15 Studies using common questions of interest to patients and laypeople usually show high accuracy rates. In a recent study evaluating the chatbot recommendations regarding breast cancer prevention and screening, an accuracy rate of 88% for ChatGPT 3.5 was reported.…”

Section: Discussionmentioning

confidence: 99%

See 4 more Smart Citations

Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence

Pinto,

de Azevedo,

Wroclawski

et al. 2024

Neurourology and Urodynamics

View full text Add to dashboard Cite

IntroductionArtificial intelligence (AI) shows immense potential in medicine and Chat generative pretrained transformer (ChatGPT) has been used for different purposes in the field. However, it may not match the complexity and nuance of certain medical scenarios. This study evaluates the accuracy of ChatGPT 3.5 and 4 in providing recommendations regarding the management of postprostatectomy urinary incontinence (PPUI), considering The Incontinence After Prostate Treatment: AUA/SUFU Guideline as the best practice benchmark.Materials and MethodsA set of questions based on the AUA/SUFU Guideline was prepared. Queries included 10 conceptual questions and 10 case‐based questions. All questions were open and entered into the ChatGPT with a recommendation to limit the answer to 200 words, for greater objectivity. Responses were graded as correct (1 point); partially correct (0.5 point), or incorrect (0 point). Performances of versions 3.5 and 4 of ChatGPT were analyzed overall and separately for the conceptual and the case‐based questions.ResultsChatGPT 3.5 scored 11.5 out of 20 points (57.5% accuracy), while ChatGPT 4 scored 18 (90.0%; p = 0.031). In the conceptual questions, ChatGPT 3.5 provided accurate answers to six questions along with one partially correct response and three incorrect answers, with a final score of 6.5. In contrast, ChatGPT 4 provided correct answers to eight questions and partially correct answers to two questions, scoring 9.0. In the case‐based questions, ChatGPT 3.5 scored 5.0, while ChatGPT 4 scored 9.0. The domains where ChatGPT performed worst were evaluation, treatment options, surgical complications, and special situations.ConclusionChatGPT 4 demonstrated superior performance compared to ChatGPT 3.5 in providing recommendations for the management of PPUI, using the AUA/SUFU Guideline as a benchmark. Continuous monitoring is essential for evaluating the development and precision of AI‐generated medical information.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

See 3 more Smart Citations

Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence

Pinto,

de Azevedo,

Wroclawski

et al. 2024

Neurourology and Urodynamics

View full text Add to dashboard Cite

show abstract

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination

Hirano,

Hanaoka,

Nakao

et al. 2024

Jpn J Radiol

View full text Add to dashboard Cite

Purpose To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI’s latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE). Materials and methods The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar’s exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test. Results The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses. Conclusion No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.

show abstract

Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination

Hsieh,

Lin

2024

Heliyon

View full text Add to dashboard Cite

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study

Cited by 35 publications

References 16 publications

Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence

Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination

Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination

Contact Info

Product

Resources

About