Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Liu, Fenglin; Li, Zheng; Zhou, Hongjian; Yin, Qingyu; Yang, Jingfeng; Tang, Xianfeng; Luo, Chen; Zeng, Ming; Jiang, Haoming; Gao, Yifan; Nigam, Priyanka; Nag, Sreyashi; Yin, Bing; Hua, Yining; Zhou, Xuan; Rohanian, Omid; Thakur, Anshul; Clifton, Lei; Clifton, David A.

doi:10.1101/2024.04.24.24306315

Cited by 2 publications

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Multi-step Transfer Learning in Natural Language Processing for the Health Domain

Manaka,

Zyl,

Kar

et al. 2024

Neural Process Lett

View full text Add to dashboard Cite

The restricted access to data in healthcare facilities due to patient privacy and confidentiality policies has led to the application of general natural language processing (NLP) techniques advancing relatively slowly in the health domain. Additionally, because clinical data is unique to various institutions and laboratories, there are not enough standards and conventions for data annotation. In places without robust death registration systems, the cause of death (COD) is determined through a verbal autopsy (VA) report. A non-clinician field agent completes a VA report using a set of standardized questions as guide to identify the symptoms of a COD. The narrative text of the VA report is used as a case study to examine the difficulties of applying NLP techniques to the healthcare domain. This paper presents a framework that leverages knowledge across multiple domains via two domain adaptation techniques: feature extraction and fine-tuning. These techniques aim to improve VA text representations for COD classification tasks in the health domain. The framework is motivated by multi-step learning, where a final learning task is realized via a sequence of intermediate learning tasks. The framework builds upon the strengths of the Bidirectional Encoder Representations from Transformers (BERT) and Embeddings from Language Models (ELMo) models pretrained on the general English and biomedical domains. These models are employed to extract features from the VA narratives. Our results demonstrate improved performance when initializing the learning of BERT embeddings with ELMo embeddings. The benefit of incorporating character-level information for learning word embeddings in the English domain, coupled with word-level information for learning word embeddings in the biomedical domain, is also evident.

show abstract

Multi-step Transfer Learning in Natural Language Processing for the Health Domain

Manaka,

Zyl,

Kar

et al. 2024

Neural Process Lett

View full text Add to dashboard Cite

show abstract

Performance of Open-Source LLMs in Challenging Radiological Cases – A Benchmark Study on 1,933 Eurorad Case Reports

Kim,

Schramm,

Adams

et al. 2024

Preprint

View full text Add to dashboard Cite

Recent advancements in large language models (LLMs) have created new ways to support radiological diagnostics. While both open-source and proprietary LLMs can address privacy concerns through local or cloud deployment, open-source models provide advantages in continuity of access, and potentially lower costs. In this study, we evaluated the diagnostic performance of eleven state-of-the-art open-source LLMs using clinical and imaging descriptions from 1,933 case reports in the Eurorad library. LLMs provided differential diagnoses based on clinical history and imaging findings. Responses were considered correct if the true diagnosis was included in the top three LLM suggestions. Llama-3-70B evaluated LLM responses, with its accuracy validated against radiologist ratings in a case subset. Models were further tested on 60 non-public brain MRI cases from a tertiary hospital to assess generalizability. Llama-3-70B demonstrated superior performance, followed by Gemma-2-27B and Mixtral-8x-7B. Similar performance results were found in the non-public dataset, where Llama-3-70B, Gemma-2-27B, and Mixtral-8x-7B again emerged as the top models. Our findings highlight the potential of open-source LLMs as decision support tools for radiological differential diagnosis in challenging, real-world cases.

show abstract

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Cited by 2 publications

References 54 publications

Multi-step Transfer Learning in Natural Language Processing for the Health Domain

Multi-step Transfer Learning in Natural Language Processing for the Health Domain

Performance of Open-Source LLMs in Challenging Radiological Cases – A Benchmark Study on 1,933 Eurorad Case Reports

Contact Info

Product

Resources

About