Large Language Model (LLM)-Powered Chatbots Fail to Generate Guideline-Consistent Content on Resuscitation and May Provide Potentially Harmful Advice

Birkun, Alexei A.; Gautam, Adhish

doi:10.1017/s1049023x23006568

Cited by 9 publications

(14 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors were primarily affiliated with institutions in the United States (n=47 of 122 different countries identified per publication, 38.5%), followed by Germany (n=11/122, 9%), Turkey (n=7/122, 5.7%), the United Kingdom (n=6/122, 4.9%), China/Australia/Italy (n=5/122, 4.1%, respectively), and 24 (n=36/122, 29.5%) other countries. Most studies examined one or more applications based on the GPT-3.5 architecture (n=66 of 124 different LLMs examined per study, 53.2%) 13,26–29,31–34,36–40,42–49,52–54,56–61,63,65–67,71,72,74,75,77,78,81–89,91,92,94,95,97–100,102–104,106–109,111 , followed by GPT-4 (n=33/124, 26.6%) 13,25,27,29,30,34–36,41,43,50,51,54,55,58,61,64,68–70,74,76,79–81,83,87,89,90,93,96,98,99,101,105 , Bard (n=10/124, 8.1%; now known as Gemini) 33,48,49,55,73,74,80,87,94,99 , Bing Chat (n=7/124, 5.7%; now Microsoft Copilot) 49,51,55,73,94,99,110 , and other applications based on Bidirectional Encoder Representations from Transformers (BERT; n=4/124, 3...…”

Section: Resultsmentioning

confidence: 99%

“…Most studies examined one or more applications based on the GPT-3.5 architecture (n=66 of 124 different LLMs examined per study, 53.2%) 13,26–29,31–34,36–40,42–49,52–54,56–61,63,65–67,71,72,74,75,77,78,81–89,91,92,94,95,97–100,102–104,106–109,111 , followed by GPT-4 (n=33/124, 26.6%) 13,25,27,29,30,34–36,41,43,50,51,54,55,58,61,64,68–70,74,76,79–81,83,87,89,90,93,96,98,99,101,105 , Bard (n=10/124, 8.1%; now known as Gemini) 33,48,49,55,73,74,80,87,94,99 , Bing Chat (n=7/124, 5.7%; now Microsoft Copilot) 49,51,55,73,94,99,110 , and other applications based on Bidirectional Encoder Representations from Transformers (BERT; n=4/124, 3.2%) 13,83,84 , Large Language Model Meta-AI (LLaMA; n=3/124, 2.4%) 55 , or Claude by Anthropic (n=1/124, 0.8%) 55 . The majority of applications were p...…”

Section: Resultsmentioning

confidence: 99%

“…In addition, data-related limitations were identified, including limited access to data on the internet (n=22/89, 24.7%) 38,39,41,43,54–57,59,60,64,76,79,82–84,88,91,94,96,104,109 , the undisclosed origin of training data (n=36/89, 40.5%) 25,26,29,30,32,34,36,37,40,46,47,50,51,53–60,64,65,70,71,76,82,83,91,94–96,101,105,109 , limitations in providing, evaluating, and validating references (n=20/89, 22.5%) 45,49,54–57,65,71,73,76,80,83,85,91,94,96,98,101,103,105 , and storage/processing of sensitive health information (n=8/89, 9%) 13,34,46,55,62,76,83,109 . Further second-order concepts included black-box algorithms, i.e., non-explainable AI (n=12/89, 13.5%) 27,36,55,57,65,73,76,83,91,94,103,105 , limited engagement and dialogue capabilities (n=10/89) 13,27,28,37,…”

Section: Resultsmentioning

confidence: 99%

“…The evaluation of limitations in output data yielded 7 second-order codes concerning the non-reproducibility (n=38/89, 42.7%) 28,29,34,38,39,41,43,45,46,49,54–61,64,65,71–73,76,80,82,83,85,90,91,94,96,98,99,101,103–105 , non-comprehensiveness (n=78/89, 87.6%) 13,25,26,28–30,32–44,46,48–62,64,65,67–79,81–98,100,102–107,109–111 , incorrectness (n=78/89, 87.6%) 13,25–44,46,49–52,54–62,64–66,69–79,81–85,87–107,109–111 , (un-)safeness (n=39/89, 43.8%) 28,30,35,37,39,40,42–44,46,50,51,57–60,62,64,65,69,70,73,74,76,78–80,82,84,85,91…”

Section: Resultsmentioning

confidence: 99%

“…For non-reproducibility, key concepts included the non-deterministic nature of the output, e.g., due to inconsistent results across multiple iterations (n=34/89, 38.2%) 28,29,34,38,39,41,43,46,58–61,72,76,82,90,94,98,99,101,103,104 and the inability to provide reliable references (n=20/89, 22.5%) 45,49,54–57,65,71,73,76,80,83,85,91,94,96,98,101,103,105 . Non-comprehensiveness included nine concepts related to generic/non-personalized output (n=34/89, 38.2%) 13,28,30,34,37,38,41,43,49,51,56,57,59,61,65,70,77,79,81,84–86,90,94,95,100,102–107,110 , incompleteness of output (n=68/89, 76.4%) 13,25,26,28–30,32,34–39,41–44,46,49–52,55–62,64,65,67–69,72–77,79,81–86,89–98,…”

Section: Resultsmentioning

confidence: 99%

See 4 more Smart Citations

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Busch,

Hoffmann,

Rueger

et al. 2024

Preprint

View full text Add to dashboard Cite

The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care using a data-driven convergent synthesis approach. We searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4,349 initial records, 89 studies across 29 medical specialties were included, primarily examining models based on the GPT-3.5 (53.2%, n=66 of 124 different LLMs examined per study) and GPT-4 (26.6%, n=33/124) architectures in medical question answering, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations included 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations included 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. In conclusion, this study is the first review to systematically map LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

See 3 more Smart Citations

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Busch,

Hoffmann,

Rueger

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Cascella,

Semeraro,

Montomoli

et al. 2024

J Med Syst

View full text Add to dashboard Cite

Within the domain of Natural Language Processing (NLP), Large Language Models (LLMs) represent sophisticated models engineered to comprehend, generate, and manipulate text resembling human language on an extensive scale. They are transformer-based deep learning architectures, obtained through the scaling of model size, pretraining of corpora, and computational resources. The potential healthcare applications of these models primarily involve chatbots and interaction systems for clinical documentation management, and medical literature summarization (Biomedical NLP). The challenge in this field lies in the research for applications in diagnostic and clinical decision support, as well as patient triage. Therefore, LLMs can be used for multiple tasks within patient care, research, and education. Throughout 2023, there has been an escalation in the release of LLMs, some of which are applicable in the healthcare domain. This remarkable output is largely the effect of the customization of pre-trained models for applications like chatbots, virtual assistants, or any system requiring human-like conversational engagement. As healthcare professionals, we recognize the imperative to stay at the forefront of knowledge. However, keeping abreast of the rapid evolution of this technology is practically unattainable, and, above all, understanding its potential applications and limitations remains a subject of ongoing debate. Consequently, this article aims to provide a succinct overview of the recently released LLMs, emphasizing their potential use in the field of medicine. Perspectives for a more extensive range of safe and effective applications are also discussed. The upcoming evolutionary leap involves the transition from an AI-powered model primarily designed for answering medical questions to a more versatile and practical tool for healthcare providers such as generalist biomedical AI systems for multimodal-based calibrated decision-making processes. On the other hand, the development of more accurate virtual clinical partners could enhance patient engagement, offering personalized support, and improving chronic disease management.

show abstract

Towards the development of a conceptual framework for improving the quality of public information on cardiopulmonary resuscitation

Birkun

2024

The American Journal of Emergency Medicine

View full text Add to dashboard Cite

Large Language Model (LLM)-Powered Chatbots Fail to Generate Guideline-Consistent Content on Resuscitation and May Provide Potentially Harmful Advice

Cited by 9 publications

References 25 publications

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Towards the development of a conceptual framework for improving the quality of public information on cardiopulmonary resuscitation

Contact Info

Product

Resources

About