2023
DOI: 10.1101/2023.09.12.23295399
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning

Shreya Johri,
Jaehwan Jeong,
Benjamin A. Tran
et al.

Abstract: Large Language Models (LLMs) show promise for medical diagnosis, but traditional evaluations using static exam questions overlook the complexity of real-world clinical dialogues. We introduce a multi-agent conversational framework where doctor-AI and patient-AI agents interact to diagnose medical conditions, evaluated by a grader-AI agent and medical experts. We assessed the diagnostic accuracy of GPT-4 and GPT-3.5, in conversational versus static settings using 140 cases focusing on skin diseases. Our study r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
7
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(7 citation statements)
references
References 36 publications
0
7
0
Order By: Relevance
“…The authors were primarily affiliated with institutions in the United States (n=47 of 122 different countries identified per publication, 38.5%), followed by Germany (n=11/122, 9%), Turkey (n=7/122, 5.7%), the United Kingdom (n=6/122, 4.9%), China/Australia/Italy (n=5/122, 4.1%, respectively), and 24 (n=36/122, 29.5%) other countries. Most studies examined one or more applications based on the GPT-3.5 architecture (n=66 of 124 different LLMs examined per study, 53.2%) 13,2629,3134,3640,4249,5254,5661,63,6567,71,72,74,75,77,78,8189,91,92,94,95,97100,102–104,106109,111 , followed by GPT-4 (n=33/124, 26.6%) 13,25,27,29,30,3436,41,43,50,51,54,55,58,61,64,6870,74,76,7981,83,87,89,90,93,96,98,99,101,105 , Bard (n=10/124, 8.1%; now known as Gemini) 33,48,49,55,73,74,80,87,94,99 , Bing Chat (n=7/124, 5.7%; now Microsoft Copilot) 49,51,55,73,94,99,110 , and other applications based on Bidirectional Encoder Representations from Transformers (BERT; n=4/124, 3...…”
Section: Resultsmentioning
confidence: 99%
See 4 more Smart Citations
“…The authors were primarily affiliated with institutions in the United States (n=47 of 122 different countries identified per publication, 38.5%), followed by Germany (n=11/122, 9%), Turkey (n=7/122, 5.7%), the United Kingdom (n=6/122, 4.9%), China/Australia/Italy (n=5/122, 4.1%, respectively), and 24 (n=36/122, 29.5%) other countries. Most studies examined one or more applications based on the GPT-3.5 architecture (n=66 of 124 different LLMs examined per study, 53.2%) 13,2629,3134,3640,4249,5254,5661,63,6567,71,72,74,75,77,78,8189,91,92,94,95,97100,102–104,106109,111 , followed by GPT-4 (n=33/124, 26.6%) 13,25,27,29,30,3436,41,43,50,51,54,55,58,61,64,6870,74,76,7981,83,87,89,90,93,96,98,99,101,105 , Bard (n=10/124, 8.1%; now known as Gemini) 33,48,49,55,73,74,80,87,94,99 , Bing Chat (n=7/124, 5.7%; now Microsoft Copilot) 49,51,55,73,94,99,110 , and other applications based on Bidirectional Encoder Representations from Transformers (BERT; n=4/124, 3...…”
Section: Resultsmentioning
confidence: 99%
“…In addition, data-related limitations were identified, including limited access to data on the internet (n=22/89, 24.7%) 38,39,41,43,5457,59,60,64,76,79,8284,88,91,94,96,104,109 , the undisclosed origin of training data (n=36/89, 40.5%) 25,26,29,30,32,34,36,37,40,46,47,50,51,5360,64,65,70,71,76,82,83,91,9496,101,105,109 , limitations in providing, evaluating, and validating references (n=20/89, 22.5%) 45,49,5457,65,71,73,76,80,83,85,91,94,96,98,101,103,105 , and storage/processing of sensitive health information (n=8/89, 9%) 13,34,46,55,62,76,83,109 . Further second-order concepts included black-box algorithms, i.e., non-explainable AI (n=12/89, 13.5%) 27,36,55,57,65,73,76,83,91,94,103,105 , limited engagement and dialogue capabilities (n=10/89) 13,27,28,37,…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations