Capability of GPT-4V(ision) in Japanese National Medical Licensing Examination

Nakao, Takahiro; Miki, Soichiro; Nakamura, Yuta; Kikuchi, Tomohiro; Nomura, Yukihiro; Hanaoka, Shouhei; Yoshikawa, Takeharu; Abe, Osamu

doi:10.1101/2023.11.07.23298133

Cited by 6 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A total of 557 case reports were identified. The exclusion criteria were carefully chosen based on previous studies for CDSSs [ 32 ] and ChatGPT-4V [ 28 ] to ensure the focus remained on diagnostically challenging adult cases with relevant image data. Specifically, cases were excluded for the following reasons: nondiagnosis (130 cases), patients younger than 10 years (35 cases), and the absence of image data (29 cases).…”

Section: Methodsmentioning

confidence: 99%

“…Preliminary studies in various fields, including medicine [ 26 - 28 ] and others [ 29 - 31 ] have shown the effectiveness of ChatGPT-4V. Some of these studies have highlighted its efficacy in interpreting medical images [ 26 , 28 ], though they were limited in scope. However, clinical image data includes a wide range of elements, from physical examinations to various investigation results.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating ChatGPT-4’s Diagnostic Accuracy: Impact of Visual Data Integration

Hirosawa,

Harada,

Tokumasu

et al. 2024

JMIR Med Inform

View full text Add to dashboard Cite

Background In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear. Objective This study aims to assess the impact of adding image data on ChatGPT-4’s diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data. Methods We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis. Results The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V’s performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, χ2 test). Additionally, ChatGPT-4’s self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases. Conclusions Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Evaluating ChatGPT-4’s Diagnostic Accuracy: Impact of Visual Data Integration

Hirosawa,

Harada,

Tokumasu

et al. 2024

JMIR Med Inform

View full text Add to dashboard Cite

show abstract

“…Recent studies have further explored the diagnostic application of multimodal LLMs (also called 'vision-language models') that are able to ingest not only text but also image data as input (12)(13)(14)(15)(16)(17)(18)(19)(20). However, several studies demonstrated low performance of Generative Pretrained Transformer 4 Vision (GPT-4V) by OpenAI in differential diagnosis based on various types of radiological images (12,16,18,20,21).…”

Section: Introductionmentioning

confidence: 99%

Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4(V) in Challenging Brain MRI Cases

Schramm,

Preis,

Metz

et al. 2024

Preprint

View full text Add to dashboard Cite

Background Recent studies have explored the application of multimodal large language models (LLMs) in radiological differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. Purpose To evaluate the impact of varying multimodal input elements on the accuracy of GPT-4(V)-based brain MRI differential diagnosis. Methods Thirty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image, image annotation, medical history, image description) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (© PerplexityAI, powered by GPT-4(V)). Accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a chi-square test and a Kruskal-Wallis test. Results were corrected for false discovery rate employing the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each individual input element to diagnostic performance. Results The prompt group containing an annotated image, medical history, and image description as input exhibited the highest diagnostic accuracy (67.8% correct responses). Significant differences were observed between prompt groups, especially between groups that contained the image description among their inputs, and those that did not. Regression analyses confirmed a large positive effect of the image description on diagnostic accuracy (p << 0.001), as well as a moderate positive effect of the medical history (p < 0.001). The presence of unannotated or annotated images had only minor or insignificant effects on diagnostic accuracy. Conclusion The textual description of radiological image findings was identified as the strongest contributor to performance of GPT-4(V) in brain MRI differential diagnosis, followed by the medical history. The unannotated or annotated image alone yielded very low diagnostic performance. These findings offer guidance on the effective utilization of multimodal LLMs in clinical practice.

show abstract

Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology – a recent scoping review

Ullah,

Parwani,

Baig

et al. 2024

Diagn Pathol

View full text Add to dashboard Cite

Background The integration of large language models (LLMs) like ChatGPT in diagnostic medicine, with a focus on digital pathology, has garnered significant attention. However, understanding the challenges and barriers associated with the use of LLMs in this context is crucial for their successful implementation. Methods A scoping review was conducted to explore the challenges and barriers of using LLMs, in diagnostic medicine with a focus on digital pathology. A comprehensive search was conducted using electronic databases, including PubMed and Google Scholar, for relevant articles published within the past four years. The selected articles were critically analyzed to identify and summarize the challenges and barriers reported in the literature. Results The scoping review identified several challenges and barriers associated with the use of LLMs in diagnostic medicine. These included limitations in contextual understanding and interpretability, biases in training data, ethical considerations, impact on healthcare professionals, and regulatory concerns. Contextual understanding and interpretability challenges arise due to the lack of true understanding of medical concepts and lack of these models being explicitly trained on medical records selected by trained professionals, and the black-box nature of LLMs. Biases in training data pose a risk of perpetuating disparities and inaccuracies in diagnoses. Ethical considerations include patient privacy, data security, and responsible AI use. The integration of LLMs may impact healthcare professionals’ autonomy and decision-making abilities. Regulatory concerns surround the need for guidelines and frameworks to ensure safe and ethical implementation. Conclusion The scoping review highlights the challenges and barriers of using LLMs in diagnostic medicine with a focus on digital pathology. Understanding these challenges is essential for addressing the limitations and developing strategies to overcome barriers. It is critical for health professionals to be involved in the selection of data and fine tuning of the models. Further research, validation, and collaboration between AI developers, healthcare professionals, and regulatory bodies are necessary to ensure the responsible and effective integration of LLMs in diagnostic medicine.

show abstract

Capability of GPT-4V(ision) in Japanese National Medical Licensing Examination

Cited by 6 publications

References 19 publications

Evaluating ChatGPT-4’s Diagnostic Accuracy: Impact of Visual Data Integration

Evaluating ChatGPT-4’s Diagnostic Accuracy: Impact of Visual Data Integration

Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4(V) in Challenging Brain MRI Cases

Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology – a recent scoping review

Contact Info

Product

Resources

About