Improving Radiology Summarization with Radiograph and Anatomy Prompts

Hu, Jinpeng; Chen, Zhihong; Liu, Yang; Xiang, Wei; Chang, Tsung-Hui

doi:10.48550/arxiv.2210.08303

Cited by 1 publication

(1 citation statement)

References 32 publications

(54 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 5 show that, among all LLMs, GPT-4 (OpenAI, 2023c) consistently achieves the best results on all generation tasks, showcasing its exceptional capability in capturing and summarizing important clinical findings compared to other LLMs. Nonetheless, the task-specific SOTA model (Hu et al, 2022) achieves 46.1 and 67.9 ROUGE-L scores on MIMIC-CXR and IU-Xray, respectively, significantly higher than all LLMs.…”

Section: Resultsmentioning

confidence: 93%

Large Language Models in the Clinic: A Comprehensive Benchmark

Liu,

Zhou,

Hua

et al. 2024

Preprint

View full text Add to dashboard Cite

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering task with answer options for evaluation. However, in real clinical settings, many clinical decisions, such as treatment recommendations, involve answering open-ended questions without pre-set options. Meanwhile, existing studies mainly use accuracy to assess model performance. In this paper, we comprehensively benchmark diverse LLMs in healthcare, to clearly understand their strengths and weaknesses. Our benchmark containsseventasks andthirteendatasets across medical language generation, understanding, and reasoning. We conduct a detailed evaluation of existingsixteenLLMs in healthcare under both zero-shot and few-shot (i.e., 1,3,5-shot) learning settings. We report the results onfivemetrics (i.e. matching, faithfulness, comprehensiveness, generalizability, and robustness) that are critical in achieving trust from clinical users. We further invite medical experts to conduct human evaluation.

show abstract

Section: Resultsmentioning

confidence: 93%