Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study

Goh, Ethan; Gallo, Robert; Hom, Jason; Strong, Eric; Weng, Yingjie; Kerman, Hannah; Cool, Josephine; Kanjee, Zahir; Parsons, Andrew S.; Ahuja, Neera; Horvitz, Eric; Yang, Daniel; Milstein, Arnold; Olson, Andrew P.J; Rodman, Adam; Chen, Jonathan H

doi:10.1101/2024.03.12.24303785

Cited by 4 publications

(3 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, regardless of the specific performance metrics of any LLM-based tool 16 , correct tool usage remains crucial. Effective prompting strategies 17,18 and appropriate application by users are essential for optimizing the performance of these tools 19,20 .…”

Section: Discussionmentioning

confidence: 99%

Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience

Alvarez-Estape,

Cano,

Pino

et al. 2024

Preprint

View full text Add to dashboard Cite

ImportanceThe time to accurately diagnose rare pediatric diseases often spans years. Assessing the diagnostic accuracy of an LLM-based tool on real pediatric cases can help reduce this time, providing quicker diagnoses for patients and their families.ObjectiveTo evaluate the clinical utility of DxGPT as a support tool for differential diagnosis of both common and rare diseases.DesignUnicentric descriptive cross-sectional exploratory study. Anonymized data from 50 pediatric patients' medical histories, covering common and rare pathologies, were used to generate clinical case notes. Each clinical case included essential data, with some expanded by complementary tests.SettingThis study was conducted at a reference pediatric hospital, Sant Joan de Déu Barcelona Children′s Hospital.ParticipantsA total of 50 clinical cases were diagnosed by 78 volunteer doctors (medical diagnostic team) with varying experience, each reviewing 3 clinical cases.InterventionsEach clinician listed up to five diagnoses per clinical case note. The same was done on the DxGPT web platform, obtaining the Top-5 diagnostic proposals. To evaluate DxGPT's variability, each note was queried three times.Main Outcome(s) and Measure(s)The study mainly focused on comparing diagnostic accuracy, defined as the percentage of cases with the correct diagnosis, between the medical diagnostic team and DxGPT. Other evaluation criteria included qualitative assessments. The medical diagnostic team also completed a survey on their user experience with DxGPT.ResultsTop-5 diagnostic accuracy was 65% for clinicians and 60% for DxGPT, with no significant differences. Accuracies for common diseases were higher (Clinicians: 79%, DxGPT: 71%) than for rare diseases (Clinicians: 50%, DxGPT: 49%). Accuracy increased similarly in both groups with expanded information, but this increase was only stastically significant in clinicians (simple 52% vs. expanded 69%; p=0.03). DxGPT′s response variability affected less than 5% of clinical case notes. A survey of 48 clinicians rated the DxGPT platform 3.9/5 overall, 4.1/5 for usefulness, and 4.5/5 for usability.Conclusions and RelevanceDxGPT showed diagnostic accuracies similar to medical staff from a pediatric hospital, indicating its potential for supporting differential diagnosis in other settings. Clinicians praised its usability and simplicity. These tools could provide new insights for challenging diagnostic cases.

show abstract

Section: Discussionmentioning

confidence: 99%

Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience

Alvarez-Estape,

Cano,

Pino

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…There is also significant practical interest in examining whether ChatGPT exhibits a more pronounced beneficial effect on diagnostic accuracy and the quantity of differential diagnoses considered, potentially attributable to its heightened computational capabilities. 12 Additionally, we seek to assess whether brief instructional training emphasising the importance of expanding the hypothesis space augments these effects. To achieve this, our primary focus is on modelling the dependent variables diagnostic accuracy and number of generated differential diagnoses using linear mixed-effects models 54 in R. 55…”

Section: Methods and Analysismentioning

confidence: 99%

“… 19 23 It is, therefore, imperative to comprehensively explore the extent, application and constraints of LLMs in clinical decision support to guarantee their conscientious and efficient implementation in practice. 12 18 24 25 To address these concerns, this prospective, randomised controlled clinical vignette study examines the influence of decision support using an LLM (ChatGPT) on the diagnostic process and outcomes compared with that of a human coach. This will advance the understanding of how human–AI collaboration can be leveraged to enhance diagnostic decision-making.…”

Section: Introductionmentioning

confidence: 99%

Effects of interacting with a large language model compared with a human coach on the clinical diagnostic process and outcomes among fourth-year medical students: study protocol for a prospective, randomised experiment using patient vignettes

Kämmer,

Hautz,

Krummrey

et al. 2024

BMJ Open

View full text Add to dashboard Cite

IntroductionVersatile large language models (LLMs) have the potential to augment diagnostic decision-making by assisting diagnosticians, thanks to their ability to engage in open-ended, natural conversations and their comprehensive knowledge access. Yet the novelty of LLMs in diagnostic decision-making introduces uncertainties regarding their impact. Clinicians unfamiliar with the use of LLMs in their professional context may rely on general attitudes towards LLMs more broadly, potentially hindering thoughtful use and critical evaluation of their input, leading to either over-reliance and lack of critical thinking or an unwillingness to use LLMs as diagnostic aids. To address these concerns, this study examines the influence on the diagnostic process and outcomes of interacting with an LLM compared with a human coach, and of prior training vs no training for interacting with either of these ‘coaches’. Our findings aim to illuminate the potential benefits and risks of employing artificial intelligence (AI) in diagnostic decision-making.Methods and analysisWe are conducting a prospective, randomised experiment with N=158 fourth-year medical students from Charité Medical School, Berlin, Germany. Participants are asked to diagnose patient vignettes after being assigned to either a human coach or ChatGPT and after either training or no training (both between-subject factors). We are specifically collecting data on the effects of using either of these ‘coaches’ and of additional training on information search, number of hypotheses entertained, diagnostic accuracy and confidence. Statistical methods will include linear mixed effects models. Exploratory analyses of the interaction patterns and attitudes towards AI will also generate more generalisable knowledge about the role of AI in medicine.Ethics and disseminationThe Bern Cantonal Ethics Committee considered the study exempt from full ethical review (BASEC No: Req-2023-01396). All methods will be conducted in accordance with relevant guidelines and regulations. Participation is voluntary and informed consent will be obtained. Results will be published in peer-reviewed scientific medical journals. Authorship will be determined according to the International Committee of Medical Journal Editors guidelines.

show abstract

Natural Language Processing in medicine and ophthalmology: A review for the 21st-century clinician

Rojas-Carabali,

Agrawal,

Gutierrez-Sinisterra

et al. 2024

Asia-Pacific Journal of Ophthalmology

View full text Add to dashboard Cite

Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study

Cited by 4 publications

References 37 publications

Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience

Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience

Effects of interacting with a large language model compared with a human coach on the clinical diagnostic process and outcomes among fourth-year medical students: study protocol for a prospective, randomised experiment using patient vignettes

Natural Language Processing in medicine and ophthalmology: A review for the 21st-century clinician

Contact Info

Product

Resources

About