Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use

Coskun, Belkis Nihan; Yagiz, Burcu; Ocakoglu, Gokhan; Dalkilic, Ediz; Pehlivan, Yavuz

doi:10.1007/s00296-023-05473-5

Cited by 19 publications

(6 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This study found that ChatGPT provided more detailed and accurate responses to patient questions about ROP, with 98% of answers falling into the “agreed” or “strongly agreed” category compared to BingAI and Gemini. A similar result was found by Coskun et al [ 16 ] in questions about methotrexate use, as ChatGPT achieved a 100% correct answer rate, while Bard (currently known as Gemini) and BingAI scored 73.91%. In another study assessing the quality and readability of AI chatbot-generated answers to frequently asked clinical inquiries in the field of bariatric and metabolic surgery, a significant difference was observed in the proportion of appropriate answers among the three LLMs: ChatGPT-4 led with 85.7%, followed by Bard at 74.3%, and BingAI at 25.7% [ 26 ].…”

Section: Discussionsupporting

confidence: 86%

“…Each of these models—ChatGPT-4 with its broad conversational capabilities, BingAI with its research-centric prowess, and Gemini with its real-time information synthesis—reflects the strategic priorities of their respective developers and offers distinct advantages depending on the application. Therefore, each may behave differently in response to patient inquiries about medical conditions [ 16 , 17 ]. Similar studies in the ophthalmology literature also report varying results regarding the success of these LLMs in providing professional medical information or responding to patient inquiries [ 8 , 9 , 10 , 11 , 12 ].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity

Durmaz Engin,

Karatas,

Ozturk

2024

Children

View full text Add to dashboard Cite

Background: Large language models (LLMs) are becoming increasingly important as they are being used more frequently for providing medical information. Our aim is to evaluate the effectiveness of electronic artificial intelligence (AI) large language models (LLMs), such as ChatGPT-4, BingAI, and Gemini in responding to patient inquiries about retinopathy of prematurity (ROP). Methods: The answers of LLMs for fifty real-life patient inquiries were assessed using a 5-point Likert scale by three ophthalmologists. The models’ responses were also evaluated for reliability with the DISCERN instrument and the EQIP framework, and for readability using the Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), and Coleman-Liau Index. Results: ChatGPT-4 outperformed BingAI and Gemini, scoring the highest with 5 points in 90% (45 out of 50) and achieving ratings of “agreed” or “strongly agreed” in 98% (49 out of 50) of responses. It led in accuracy and reliability with DISCERN and EQIP scores of 63 and 72.2, respectively. BingAI followed with scores of 53 and 61.1, while Gemini was noted for the best readability (FRE score of 39.1) but lower reliability scores. Statistically significant performance differences were observed particularly in the screening, diagnosis, and treatment categories. Conclusion: ChatGPT-4 excelled in providing detailed and reliable responses to ROP-related queries, although its texts were more complex. All models delivered generally accurate information as per DISCERN and EQIP assessments.

show abstract

Section: Discussionsupporting

confidence: 86%

Section: Introductionmentioning

confidence: 99%

Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity

Durmaz Engin,

Karatas,

Ozturk

2024

Children

View full text Add to dashboard Cite

show abstract

“…We also reviewed a study assessing the accuracy and completeness of several LLMs when answering Methotrexate-related questions. 23 This study was excluded because it focused solely on the pharmacological treatment of rheumatic disease. For a detailed breakdown of the inclusion and exclusion process at each stage, please refer to the PRISMA flowchart in Figure 1.…”

Section: Screening Resultsmentioning

confidence: 99%

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Busch,

Hoffmann,

Rueger

et al. 2024

Preprint

View full text Add to dashboard Cite

The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care using a data-driven convergent synthesis approach. We searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4,349 initial records, 89 studies across 29 medical specialties were included, primarily examining models based on the GPT-3.5 (53.2%, n=66 of 124 different LLMs examined per study) and GPT-4 (26.6%, n=33/124) architectures in medical question answering, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations included 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations included 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. In conclusion, this study is the first review to systematically map LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.

show abstract

“…The investigations have yielded various results about the superiority and performance of the three AI-based chatbots. In a study on methotrexate use, ChatGPT-3.5, Bard, and Bing had correct answer rates of 100%, 73.91%, and 73.91%, respectively [33]. A comparative study in endodontics also found that ChatGPT-3.5 offered more reliable information than Bard and Bing [34].…”

Section: Discussionmentioning

confidence: 99%

Disparities in medical recommendations from AI-based chatbots across different countries/regions

Gumilar,

Indraprasta,

Hsu

et al. 2024

Preprint

View full text Add to dashboard Cite

This study explores disparities and opportunities in healthcare information provided by AI chatbots. We focused on recommendations for adjuvant therapy in endometrial cancer, analyzing responses across four regions (Indonesia, Nigeria, Taiwan, USA) and three platforms (Bard, Bing, ChatGPT-3.5). Utilizing previously published cases, we asked identical questions to chatbots from each location within a 24-hour window. Responses were double-blinded and evaluated on relevance, clarity, depth, focus, and coherence by ten endometrial cancer experts. Our analysis revealed significant variations across different countries/regions (p < 0.001). Interestingly, Bing's responses in Nigeria consistently outperformed others (p < 0.05), excelling in all evaluation criteria (p < 0.001). Bard also performed better in Nigeria compared to other regions (p < 0.05), consistently surpassing them across all categories (p < 0.001, with relevance reaching p < 0.01). Notably, Bard's overall scores were significantly higher than those of ChatGPT-3.5 and Bing in all locations (p < 0.001). These findings highlight concerning disparities and opportunities in the quality of AI-powered healthcare information based on user location and platform. This underscores the need for further research and development to ensure equitable access to reliable medical information through AI technologies.

show abstract

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use

Cited by 19 publications

References 26 publications

Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity

Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Disparities in medical recommendations from AI-based chatbots across different countries/regions

Contact Info

Product

Resources

About