Putting ChatGPT’s Medical Advice to the (Turing) Test

Nov, Oded; Singh, Nina; Mann, Devin

doi:10.1101/2023.01.23.23284735

Cited by 46 publications

(39 citation statements)

References 11 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the accuracy of computational models’ answers to medical questions is yet to exceed that of fully trained physicians, with findings in the present context of primary care being no exception [ 16 , 17 ]. When ChatGPT is used as a medical advice chatbot, advice seekers are only able to identify that the source of provided advice is computational 65% of the time [ 19 ]. It follows that health care providers must protect their patients from inaccurate information provided by this technology, as they are unable to differentiate between computational and human advice [ 19 ].…”

Section: Discussionmentioning

confidence: 99%

“…When ChatGPT is used as a medical advice chatbot, advice seekers are only able to identify that the source of provided advice is computational 65% of the time [ 19 ]. It follows that health care providers must protect their patients from inaccurate information provided by this technology, as they are unable to differentiate between computational and human advice [ 19 ]. This requirement for oversight limits the potential of LLMs to meaningfully change practice, as performance equivalent to that of experts is the minimum standard to justify autonomous deployment: there must be confidence in the accuracy and trustworthiness of answers from these applications [ 20 , 21 ].…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

Thirunavukarasu¹,

Rafiey²,

Mahmood³

et al. 2023

JMIR Med Educ

100

View full text Add to dashboard Cite

Background Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. Objective Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. Methods AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports from 2018 to 2022. Novel explanations from ChatGPT—defined as information provided that was not inputted within the question or multiple answer choices—were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT’s strengths and weaknesses. Results Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT’s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=–0.241 and –0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). Conclusions Large language models are approaching human expert–level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

Thirunavukarasu¹,

Rafiey²,

Mahmood³

et al. 2023

JMIR Med Educ

100

View full text Add to dashboard Cite

show abstract

“…We identified only one study with a similar design to ours. 18 The authors extracted patient-provider communication from electronic health records and presented five cases with provider-written answers and another five with ChatGPT-generated answers. Participants, who were recruited online, could identify 65% of AI-generated answers correctly, which is of similar magnitude to our findings, although the authors neither have formally tested any hypotheses, nor presented precision estimates for the results.…”

Section: Discussionmentioning

confidence: 99%

ChatGPT- versus human-generated answers to frequently asked questions about diabetes: a Turing test-inspired survey among employees of a Danish diabetes center

Hulmán

Dollerup

Mortensen

et al. 2023

Preprint

View full text Add to dashboard Cite

Background Large language models have received enormous attention recently with some studies demonstrating their potential clinical value, despite not being trained specifically for this domain. We aimed to investigate whether ChatGPT, a language model optimized for dialogue, can answer frequently asked questions about diabetes. Methods We conducted a closed e-survey among employees of a large Danish diabetes center. The study design was inspired by the Turing test and non-inferiority trials. Our survey included ten questions with two answers each. One of these was written by a human expert, while the other was generated by ChatGPT. Participants had the task to identify the ChatGPT-generated answer. Data was analyzed at the question-level using logistic regression with robust variance estimation with clustering at participant level. In secondary analyses, we investigated the effect of participant characteristics on the outcome. A 55% non-inferiority margin was pre-defined based on precision simulations and had been published as part of the study protocol before data collection began. Findings Among 311 invited individuals, 183 participated in the survey (59% response rate). 64% had heard of ChatGPT before, and 19% had tried it. Overall, participants could identify ChatGPT-generated answers 59.5% (95% CI: 57.0, 62.0) of the time. Among participant characteristics, previous ChatGPT use had the strongest association with the outcome (odds ratio: 1.52 (1.16, 2.00), p=0.003). Previous users answered 67.4% (61.7, 72.7) of the questions correctly, versus non-users 57.6% (54.9, 60.3). Interpretation Participants could distinguish between ChatGPT-generated and human-written answers somewhat better than flipping a fair coin. However, our results suggest a stronger predictive value of linguistic features rather than the actual content. Rigorously planned studies are needed to elucidate the risks and benefits of integrating such technologies in routine clinical practice.

show abstract

“…2 Some preliminary work in the medical domain highlighted ChatGPT's ability to write realistic scientific abstracts, 3 pass medical licensing exams, 4 and accurately determine appropriate radiology studies. 5 Although ChatGPT can triage medical cases, 6 answer clinical questions consistent with the judgment of practicing physicians, 7 and provide medical advice that is perceived as human-like by non-clinicians, 8 its ability to provide appropriate and equitable advice to patients across a range of clinical contexts remain unknown. These knowledge gaps are important because the underlying training data and approach for ChatGPT have not been released, 9 and there are substantive concerns about the safety, fairness, and regulation of LLMs and clinical AI systems.…”

Section: Mainmentioning

confidence: 99%

Does ChatGPT Provide Appropriate and Equitable Medical Advice?: A Vignette-Based, Clinical Evaluation Across Care Contexts

Nastasi

Courtright

Halpern

et al. 2023

Preprint

View full text Add to dashboard Cite

ChatGPT is a large language model trained on text corpora and reinforced with human supervision. Because ChatGPT can provide human-like responses to complex questions, it could become an easily accessible source of medical advice for patients. However, its ability to answer medical questions appropriately and equitably remains unknown. We presented ChatGPT with 96 advice-seeking vignettes that varied across clinical contexts, medical histories, and social characteristics. We analyzed responses for clinical appropriateness by concordance with guidelines, recommendation type, and consideration of social factors. Ninety-three (97%) responses were appropriate and did not explicitly violate clinical guidelines. Recommendations in response to advice-seeking questions were completely absent (N=34, 35%), general (N=18, 18%), or specific (N=44, 46%). Fifty-three (55%) explicitly considered social factors like race or insurance status, which in some cases changed clinical recommendations. ChatGPT consistently provided background information in response to medical questions but did not reliably offer appropriate and personalized medical advice.

show abstract

Putting ChatGPT’s Medical Advice to the (Turing) Test

Cited by 46 publications

References 11 publications

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

ChatGPT- versus human-generated answers to frequently asked questions about diabetes: a Turing test-inspired survey among employees of a Danish diabetes center

Does ChatGPT Provide Appropriate and Equitable Medical Advice?: A Vignette-Based, Clinical Evaluation Across Care Contexts

Contact Info

Product

Resources

About