Can large language models reason about medical questions?

Liévin, Valentin; Hother, Christoffer Egeberg; Motzfeldt, Andreas Geert; Winther, Ole

doi:10.1016/j.patter.2024.100943

Cited by 19 publications

(3 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar results were obtained in the first input data, as in the work of Liévin et al . (2023), with a 46% accuracy for the algorithm, with zero suggestions, as well as in neurology exam answers [ 7 , 28 ]. This is also a similar result to a recent paper published by Suwała et al .…”

Section: Discussionmentioning

confidence: 99%

The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland

Nicikowski,

Szczepański,

Miedziaszczyk

et al. 2024

Clinical Kidney Journal

View full text Add to dashboard Cite

Background and hypothesis In November 2022, OpenAI released a chatbot named ChatGPT, a product capable of processing natural language to create human-like conversational dialogue. It has generated a lot of interest, including from the scientific community as well as the medical science community. Recent publications have shown that ChatGPT can correctly answer questions from medical exams such as the United States Medical Licensing Examination (USMLE) and other specialty exams. To date, there have been no studies in which ChatGPT has been tested on specialty questions in the field of nephrology anywhere in the world. Methods Using the ChatGPT-3.5 and 4.0 algorithm in this comparative cross-sectional study, we analyzed 1560 single-answer questions from the national specialty exam in nephrology from 2017 to 2023 that were available in the Polish Medical Examination Center's question database along with answer keys. Results Of the 1556 questions posed to ChatGPT-4.0, correct answers were obtained with an accuracy of 69.84%, compared to ChatGPT-3.5 (45.70%, P = .0001) and to the top results of medical doctors (85.73%, P = .0001). Of the 13 tests, ChatGPT-4.0 exceeded the required ≥60% pass rate in 11 tests passed, and scored higher than the average of the human exam results. Conclusion ChatGPT-3.5 was not spectacularly successful in nephrology exams. The ChatGPT-4.0 algorithm was able to pass most of the analyzed nephrology specialty exams. New generations of ChatGPT achieve similar results to humans. The best results of humans are better than ChatGPT-4.0.

show abstract

Section: Discussionmentioning

confidence: 99%

The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland

Nicikowski,

Szczepański,

Miedziaszczyk

et al. 2024

Clinical Kidney Journal

View full text Add to dashboard Cite

show abstract

“…However, while there are valid concerns regarding LLMs' current limitations in handling complex reasoning tasks, there is also accumulating evidence of their improving capabilities [38,56,57]. Within an academic context, Liévin et al [58] conclude that LLMs can effectively answer and reason about medical questions, while a recent survey Chang et al [30] indicates that LLMs perform well in tasks like arithmetic reasoning and demonstrate marked competence in logical reasoning tasks too; though, they do encounter significant challenges with abstract and multi-hop reasoning, struggling particularly with tasks requiring complex, novel, or counterfactual thinking. The ability to self-critique is necessary for advanced reasoning that supports rational decision-making and problem-solving, and Luo et al [59] demonstrate the difficulty of achieving this within LLMs; however, they show how an improvement in LLM's performances on reasoning tasks can be elicited through advanced prompting techniques involving self-critique.…”

Section: Reevaluating the Critiques Of The Reasoning Capabilities Of ...mentioning

confidence: 99%

ChatGPT: The End of Online Exam Integrity?

Susnjak,

McIntosh

2024

Education Sciences

View full text Add to dashboard Cite

This study addresses the significant challenge posed by the use of Large Language Models (LLMs) such as ChatGPT on the integrity of online examinations, focusing on how these models can undermine academic honesty by demonstrating their latent and advanced reasoning capabilities. An iterative self-reflective strategy was developed for invoking critical thinking and higher-order reasoning in LLMs when responding to complex multimodal exam questions involving both visual and textual data. The proposed strategy was demonstrated and evaluated on real exam questions by subject experts and the performance of ChatGPT (GPT-4) with vision was estimated on an additional dataset of 600 text descriptions of multimodal exam questions. The results indicate that the proposed self-reflective strategy can invoke latent multi-hop reasoning capabilities within LLMs, effectively steering them towards correct answers by integrating critical thinking from each modality into the final response. Meanwhile, ChatGPT demonstrated considerable proficiency in being able to answer multimodal exam questions across 12 subjects. These findings challenge prior assertions about the limitations of LLMs in multimodal reasoning and emphasise the need for robust online exam security measures such as advanced proctoring systems and more sophisticated multimodal exam questions to mitigate potential academic misconduct enabled by AI technologies.

show abstract

“…That said, there is a substantial body of research focused on the application of LLMs and AIVAs in specialized patient education tasks. These studies evaluate the feasibility, accuracy, and suitability of these technologies for responding to inquiries across various medical fields [9,15,[20][21][22][23]. Some studies have enhanced chatbots using LLMs, successfully simulating the patient-physician dynamic.…”

Section: Introductionmentioning

confidence: 99%

Comparative Analysis of Artificial Intelligence Virtual Assistant and Large Language Models in Post-Operative Care

Borna,

Gomez-Cabello,

Pressman

et al. 2024

EJIHPE

View full text Add to dashboard Cite

In postoperative care, patient education and follow-up are pivotal for enhancing the quality of care and satisfaction. Artificial intelligence virtual assistants (AIVA) and large language models (LLMs) like Google BARD and ChatGPT-4 offer avenues for addressing patient queries using natural language processing (NLP) techniques. However, the accuracy and appropriateness of the information vary across these platforms, necessitating a comparative study to evaluate their efficacy in this domain. We conducted a study comparing AIVA (using IBM Watson Assistant) with ChatGPT-4 and Google BARD, assessing the accuracy, knowledge gap, and response appropriateness. AIVA demonstrated superior performance, with significantly higher accuracy (mean: 0.9) and lower knowledge gap (mean: 0.1) compared to BARD and ChatGPT-4. Additionally, AIVA’s responses received higher Likert scores for appropriateness. Our findings suggest that specialized AI tools like AIVA are more effective in delivering precise and contextually relevant information for postoperative care compared to general-purpose LLMs. While ChatGPT-4 shows promise, its performance varies, particularly in verbal interactions. This underscores the importance of tailored AI solutions in healthcare, where accuracy and clarity are paramount. Our study highlights the necessity for further research and the development of customized AI solutions to address specific medical contexts and improve patient outcomes.

show abstract

Can large language models reason about medical questions?

Cited by 19 publications

References 32 publications

The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland

The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland

ChatGPT: The End of Online Exam Integrity?

Comparative Analysis of Artificial Intelligence Virtual Assistant and Large Language Models in Post-Operative Care

Contact Info

Product

Resources

About