Evaluating Large Language Models on Medical Evidence Summarization

Tang, Lanfang; Sun, Zhaoyi; Idnay, Betina; Nestor, Jordan G.; Soroush, Ali; Elias, Pierre; Xu, Ziyang; Ding, Ying; Durrett, Greg; Rousseau, Justin F.; Weng, Chunhua; Peng, Yifan

doi:10.1101/2023.04.22.23288967

Cited by 12 publications

(4 citation statements)

References 16 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[34] Further studies have shed light on the reliability of content generated by Chatbots when summarising medical literature reviews, highlighting the concerning possibility of Chatbots producing unfounded statements or even fabricated information. [35] Similar instances have been observed outside of healthcare, as exempli ed by a recent court case where a legal representative resorted to the use of fabricated citations from ChatGPT3 in support of their argument.…”

Section: Discussionmentioning

confidence: 63%

No Time for ChitChat? Artificial intelligence Chatbots as a tool to identify research priorities in hip and knee arthroplasty.

Ridha,

Ahmed,

Raj

et al. 2023

Preprint

View full text Add to dashboard Cite

Background: Artificial intelligence (AI) Chatbots, such as ChatGPT3, have gained attention in medical and non-medical domains. Their ability to identify research gaps in orthopaedics is yet to be tested. Aims: This study aimed to assess the application of three AI Chatbots to identify research questions in hip and knee arthroplasty in comparison to an existing research prioritisation consensus method. Methods: Three Chatbots, ChatGPT3, Bing and Bard were prompted to identify research questions in hip and knee arthroplasty. Two authors independently compared the responses to the 21 research priorities for hip and knee arthroplasty established by the James Lind Alliance (JLA). Any discrepancies were discussed with senior authors. Results: ChatGPT3 successfully identified to 15 (71%) priorities. Bard, nine (42%) priorities, while Bing identified eight (38%). The Chatbots identified further questions that were not stated in the JLA exercise (ChatGPT3: 12 questions; Bard: 14 questions; Bing: 11 questions). All three Chatbots failed to identify five (24%) of the JLA research priorities. Conclusions: This study reports the first evidence of the potential adoption of AI Chatbots to identify research questions in hip and knee arthroplasty. This may potentially represent a valuable adjunct in improving efficiency of research prioritisation exercises.

show abstract

Section: Discussionmentioning

confidence: 63%

No Time for ChitChat? Artificial intelligence Chatbots as a tool to identify research priorities in hip and knee arthroplasty.

Ridha,

Ahmed,

Raj

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Another study on LLM outputs for medical evidence summarization tasks also employed both automatic and human evaluation. They defined summary quality based on coherence, factual consistency, comprehensiveness, and harmfulness Tang et al (2023). The researchers concluded that automatic metrics often do not strongly correlate with the quality of summaries.…”

Section: Methods For Evaluating the Performance Of Llms In Clinical T...mentioning

confidence: 99%

Can Large Language Models Provide Emergency Medical Help Where There Is No Ambulance? A Comparative Study on Large Language Model Understanding of Emergency Medical Scenarios in Resource-Constrained Settings

Mensah,

Quao,

Dagadu

2024

Preprint

View full text Add to dashboard Cite

The capabilities of Large Language Models (LLMs) have advanced since their popularization a few years ago. The healthcare sector operates on, and generates a large volume of data annually and thus, there is a growing focus on the applications of LLMs within this sector. There are a few medicine-oriented evaluation datasets and benchmarks for assessing the performance of various LLMs in clinical scenarios; however, there is a paucity of information on the real-world usefulness of LLMs in context-specific scenarios in resourceconstrained settings. In this study, 16 iterations of a decision support tool for medical emergencies using 4 distinct generalized LLMs were constructed, alongside a combination of 4 Prompt Engineering techniques: In-Context Learning with 5-shot prompting (5SP), chain-of-thought prompting (CoT), self-questioning prompting (SQP), and a stacking of self-questioning prompting and chain-of-thought (SQCT). In total 428 model responses were quantitatively and qualitatively evaluated by 22 clinicians familiar with the medical scenarios and background contexts. Our study highlights the benefits of In-Context Learning with few-shot prompting, and the utility of the relatively novel self-questioning prompting technique. We also demonstrate the benefits of combining various prompting techniques to elicit the best performance of LLMs in providing contextually applicable health information. We also highlight the need for continuous human expert verification in the development and deployment of LLM-based health applications, especially in use cases where context is paramount.

show abstract

“…Automatic text summarization is a sub-area of text mining in which a system determines the most informative information in the original text to produce a summary for certain tasks and users. To generate a summary, researchers have developed summarization systems for different purposes (i.e., single document summarization (Liu et al, 2019b), multi-document summarization (Chen et al, 2023), aspect-based opinion summarization (Wu et al, 2016), query-focused summarization (Baumel et al, 2016), update summarization (Delort, & Alfonseca, 2012), and cross-language document summarization (Wan, 2011) to summarize different text genres such as product reviews (Yu et al, 2016), news articles (Huang et al, 2011), political text (Sharevski et al, 2021), meeting text (Oya et al, 2014), scientific articles (Altmami, & Menai, 2022), online debates (Sanchan et al, 2017;Sanchan, Bontcheva, & Aker, 2020), and medical data (Abacha et al, 2021;Tang et al, 2023) with the assistant of Artificial Intelligence i.e., ChatGPT in their summarization task. Later, the generated summaries will be assessed against various criteria such as informativeness, text coherence, readability, and understandability.…”

Section: Introductionmentioning

confidence: 99%

Comparative Study on Automated Reference Summary Generation using BERT Models and ROUGE Score Assessment

Sanchan

2024

JCST

View full text Add to dashboard Cite

Automatic text summarization is a sub-area in text mining in which a computer system determines the most informative information in the original text to produce a summary for certain jobs and users. In the development of the systems, one of the most important tasks is to evaluate the quality of summaries produced by the systems. Generally, the evaluation task becomes laborious, time-consuming, and expensive because it requires significant efforts on annotation tasks for humans to manually create reference summaries. Being able to generate automatic reference summaries would promote the development of summarization systems in term of speed and evaluation. In this paper, we proposed an Auto-Ref Summary Generation framework for automatically generating reference summaries used in the generic text summarization evaluation task, the Sliced Summary. Given a set of clusters from a cluster ground-truth label dataset, variants of BERT models were utilized for creating cluster representations. The automatic reference summaries were later generated through a centroid-based summarization approach. Overall, DistilBERT, ROBERTa, and SBERT have played crucial roles in automatic summary generation, achieving the highest ROUGE-1 score of 0.47060. However, this does not meet our expectation on text coherence and readability aspects. Although the summaries generated through our proposed framework could not be used as the replacement of the manual summaries, this study has shed new light on the acquisition of automatic reference summaries from a ground-truth label dataset.

show abstract

Evaluating Large Language Models on Medical Evidence Summarization

Cited by 12 publications

References 16 publications

No Time for ChitChat? Artificial intelligence Chatbots as a tool to identify research priorities in hip and knee arthroplasty.

No Time for ChitChat? Artificial intelligence Chatbots as a tool to identify research priorities in hip and knee arthroplasty.

Can Large Language Models Provide Emergency Medical Help Where There Is No Ambulance? A Comparative Study on Large Language Model Understanding of Emergency Medical Scenarios in Resource-Constrained Settings

Comparative Study on Automated Reference Summary Generation using BERT Models and ROUGE Score Assessment

Contact Info

Product

Resources

About