Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis

He, Wenjie; Zhang, Wenyan; Jin, Ya; Zhou, Qiang; Zhang, Huadan; Xia, Qing

doi:10.2196/54706

J Med Internet Res

2024

DOI: 10.2196/54706

|View full text |Cite

Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis

Wenjie He,

Wenyan Zhang,

Ya Jin

et al.

Abstract: Background There is a dearth of feasibility assessments regarding using large language models (LLMs) for responding to inquiries from autistic patients within a Chinese-language context. Despite Chinese being one of the most widely spoken languages globally, the predominant research focus on applying these models in the medical field has been on English-speaking populations. Objective This study aims to assess the effectiveness of LLM chatbots, specific… Show more

Help me understand this report

View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Preprint2

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Performance of Artificial Intelligence Chatbots on Ultrasound Exam： Cross-Sectional Comparative Analysis (Preprint)

Zhang,

Lu,

Luo

et al. 2024

Preprint

View full text Add to dashboard Cite

BACKGROUND Artificial intelligence chatbots, including those used in the field of ultrasound, are increasingly used to answer medical questions. However, there are many chatbot models with varying performance, and the model's performance is also affected by multiple factors, such as language environment, question type, and topic. OBJECTIVE This study aimed to evaluate the performance of the ChatGPT and the ERNIE Bot in answering questions related to ultrasound medical examinations, providing a reference for users and developers. METHODS In this study, we collected actual examination papers from the field of ultrasound medicine and strictly selected 554 questions, including single-choice, multiple-choice, true or false questions, noun explanations, and short answers. The topics included basic knowledge, ultrasound examination, diagnosis, diseases and etiology, case analysis, and ultrasound signs. The questions were asked in both English and Chinese. Objective questions were evaluated based on the correct response rate, and subjective questions were evaluated by a doctor with more than 20 years of work experience and proficiency in both Chinese and English using a Likert scale. The data were imported into Excel for comparison analysis. RESULTS Of the 554 questions included in this study, single-choice questions accounted for the greatest proportion (64%), followed by short answers (12%) and noun explanations (11%), and the remaining questions were multiple-choice and true-false questions. The accuracy rates of the objective questions were ranked in the following order: true or false questions (60%-80%), single-choice questions (57.34%-62.99%), and multiple-choice questions (8.33%-39.58%). The acceptability rate of short answers to subjective questions was 65.22%~75.36%, which was slightly greater than that of noun interpretations (47.62%~61.9%). In terms of the performance comparison between models, ERNIE Bot performed slightly better than ChatGPT in several aspects. Both models showed a decline in performance when the examination questions were translated into English, but the decline was less pronounced in the ERNIE Bot. In terms of topic categories, the model performed better in terms of basic knowledge, ultrasound examination methods, diseases and etiology than in terms of ultrasound signs and ultrasound diagnosis. CONCLUSIONS In this cross-sectional study, chatbots can provide valuable answers to ultrasound examination questions, but there are performance differences between models, and the performance of the models is closely related to the input language, question type and topic. Overall, the answers of the ERNIE Bot are superior to those of the ChatGPT in many aspects. As users or developers, it is necessary to have a deep understanding of the performance characteristics of the models and select different models for different questions and language environments to fully utilize the value of chatbots and continuously optimize and improve chatbot performance. CLINICALTRIAL NONE

show abstract

Performance of Artificial Intelligence Chatbots on Ultrasound Exam： Cross-Sectional Comparative Analysis (Preprint)

Zhang,

Lu,

Luo

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

Performance of ChatGPT-4o in Real-Time Medical Consultation for Retroperitoneal Fibrosis Patients Under Doctor Supervision: A Cross-Sectional Study in a Chinese Clinical Setting (Preprint)

Gao,

Zhang,

Liu

et al. 2024

Preprint

View full text Add to dashboard Cite

BACKGROUND LLMs like GPT-4 show promise in medical consultations but face challenges in non-English or real-time contexts. The new GPT-4o, with improved text processing and faster responses, may better address rare diseases like retroperitoneal fibrosis (RPF). OBJECTIVE Performance of GPT-4o in providing real-time medical consultations for patients with rare disease remains underexplored, which is generally a challenge in clinical practice. We evaluate the competency of GPT-4o to generate responses to a rare autoimmune RPF on accuracy, completeness, readability, and quality, using a 7-point Likert scale. METHODS A total of 103 real-world RPF patients queries were collected from diverse sources. Responses were generated using the newly released version of GPT-4o (2024/5/17). All questions were also stratified and randomly divided into six groups. Six attending rheumatologists were assigned to answer one set of questions, then generated new responses with assistance of GPT-4o. All the responses were assessed blindly by three experts in RPF. RESULTS GPT-4o scored significantly higher than rheumatologists in accuracy (6.39 ± 0.50 vs. 4.99 ± 0.62), completeness (6.51 ± 0.44 vs. 4.55 ± 0.60), readability (6.45 ± 0.42 vs. 4.93 ± 0.59), and quality (6.42 ± 0.46 vs. 4.78 ± 0.55) (p < 0.001). Competency of rheumatologists + GPT-4o was better than that of rheumatologists alone (accuracy: 6.13 ± 0.63, completeness: 5.99 ± 0.81, readability: 6.05 ± 0.67, quality: 6.01 ± 0.71. p < 0.001), and physician revisions generally reduced the competency of GPT-4o. Subgroup analysis showed no significant difference on accuracy between GPT-4o and rheumatologists + GPT-4o in answering complex questions, but any type of revision lowered the competency of GPT-4o. CONCLUSIONS GPT-4o has the potential to provide real-time medical consultations for RPF in the Chinese clinical environment.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis

Cited by 2 publications

References 59 publications

Performance of Artificial Intelligence Chatbots on Ultrasound Exam： Cross-Sectional Comparative Analysis (Preprint)

Performance of Artificial Intelligence Chatbots on Ultrasound Exam： Cross-Sectional Comparative Analysis (Preprint)

Performance of ChatGPT-4o in Real-Time Medical Consultation for Retroperitoneal Fibrosis Patients Under Doctor Supervision: A Cross-Sectional Study in a Chinese Clinical Setting (Preprint)

Contact Info

Product

Resources

About