2023
DOI: 10.1101/2023.04.22.23288967
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Evaluating Large Language Models on Medical Evidence Summarization

Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study h… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(4 citation statements)
references
References 16 publications
(26 reference statements)
0
4
0
Order By: Relevance
“…[34] Further studies have shed light on the reliability of content generated by Chatbots when summarising medical literature reviews, highlighting the concerning possibility of Chatbots producing unfounded statements or even fabricated information. [35] Similar instances have been observed outside of healthcare, as exempli ed by a recent court case where a legal representative resorted to the use of fabricated citations from ChatGPT3 in support of their argument.…”
Section: Discussionmentioning
confidence: 63%
“…[34] Further studies have shed light on the reliability of content generated by Chatbots when summarising medical literature reviews, highlighting the concerning possibility of Chatbots producing unfounded statements or even fabricated information. [35] Similar instances have been observed outside of healthcare, as exempli ed by a recent court case where a legal representative resorted to the use of fabricated citations from ChatGPT3 in support of their argument.…”
Section: Discussionmentioning
confidence: 63%
“…Another study on LLM outputs for medical evidence summarization tasks also employed both automatic and human evaluation. They defined summary quality based on coherence, factual consistency, comprehensiveness, and harmfulness Tang et al (2023). The researchers concluded that automatic metrics often do not strongly correlate with the quality of summaries.…”
Section: Methods For Evaluating the Performance Of Llms In Clinical T...mentioning
confidence: 99%
“…Automatic text summarization is a sub-area of text mining in which a system determines the most informative information in the original text to produce a summary for certain tasks and users. To generate a summary, researchers have developed summarization systems for different purposes (i.e., single document summarization (Liu et al, 2019b), multi-document summarization (Chen et al, 2023), aspect-based opinion summarization (Wu et al, 2016), query-focused summarization (Baumel et al, 2016), update summarization (Delort, & Alfonseca, 2012), and cross-language document summarization (Wan, 2011) to summarize different text genres such as product reviews (Yu et al, 2016), news articles (Huang et al, 2011), political text (Sharevski et al, 2021), meeting text (Oya et al, 2014), scientific articles (Altmami, & Menai, 2022), online debates (Sanchan et al, 2017;Sanchan, Bontcheva, & Aker, 2020), and medical data (Abacha et al, 2021;Tang et al, 2023) with the assistant of Artificial Intelligence i.e., ChatGPT in their summarization task. Later, the generated summaries will be assessed against various criteria such as informativeness, text coherence, readability, and understandability.…”
Section: Introductionmentioning
confidence: 99%