Evaluation High-Quality of Information from ChatGPT (Artificial Intelligence—Large Language Model) Artificial Intelligence on Shoulder Stabilization Surgery

Hurley, Eoghan T.; Crook, Bryan S.; Lorentz, Samuel G.; Danilkowicz, Richard M.; Lau, Brian C.; Taylor, Dean C.; Dickens, Jonathan F.; Anakwenze, Oke; Klifto, Christopher S.

doi:10.1016/j.arthro.2023.07.048

Cited by 30 publications

(21 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Low-medium quality DISCERN and JAMA scores and difficult readability have been reported in studies on different subjects, similar to our study. [35,38–41] However, in our study, among the artificial intelligence chat robots, perplexity especially attracts attention with its high DISCERN and JAMA scores. The fact that the answers of this artificial intelligence chatbot contain references has caused its DISCERN and JAMA scores to be high.…”

Section: Discussionmentioning

confidence: 79%

How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses

Gül,

Erdemir,

Hanci

et al. 2024

Medicine

View full text Add to dashboard Cite

Subdural hematoma is defined as blood collection in the subdural space between the dura mater and arachnoid. Subdural hematoma is a condition that neurosurgeons frequently encounter and has acute, subacute and chronic forms. The incidence in adults is reported to be 1.72–20.60/100.000 people annually. Our study aimed to evaluate the quality, reliability and readability of the answers to questions asked to ChatGPT, Bard, and perplexity about “Subdural Hematoma.” In this observational and cross-sectional study, we asked ChatGPT, Bard, and perplexity to provide the 100 most frequently asked questions about “Subdural Hematoma” separately. Responses from both chatbots were analyzed separately for readability, quality, reliability and adequacy. When the median readability scores of ChatGPT, Bard, and perplexity answers were compared with the sixth-grade reading level, a statistically significant difference was observed in all formulas (P < .001). All 3 chatbot responses were found to be difficult to read. Bard responses were more readable than ChatGPT’s (P < .001) and perplexity’s (P < .001) responses for all scores evaluated. Although there were differences between the results of the evaluated calculators, perplexity’s answers were determined to be more readable than ChatGPT’s answers (P < .05). Bard answers were determined to have the best GQS scores (P < .001). Perplexity responses had the best Journal of American Medical Association and modified DISCERN scores (P < .001). ChatGPT, Bard, and perplexity’s current capabilities are inadequate in terms of quality and readability of “Subdural Hematoma” related text content. The readability standard for patient education materials as determined by the American Medical Association, National Institutes of Health, and the United States Department of Health and Human Services is at or below grade 6. The readability levels of the responses of artificial intelligence applications such as ChatGPT, Bard, and perplexity are significantly higher than the recommended 6th grade level.

show abstract

Section: Discussionmentioning

confidence: 79%

How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses

Gül,

Erdemir,

Hanci

et al. 2024

Medicine

View full text Add to dashboard Cite

show abstract

“…The authors were primarily affiliated with institutions in the United States (n=47 of 122 different countries identified per publication, 38.5%), followed by Germany (n=11/122, 9%), Turkey (n=7/122, 5.7%), the United Kingdom (n=6/122, 4.9%), China/Australia/Italy (n=5/122, 4.1%, respectively), and 24 (n=36/122, 29.5%) other countries. Most studies examined one or more applications based on the GPT-3.5 architecture (n=66 of 124 different LLMs examined per study, 53.2%) 13,26–29,31–34,36–40,42–49,52–54,56–61,63,65–67,71,72,74,75,77,78,81–89,91,92,94,95,97–100,102–104,106–109,111 , followed by GPT-4 (n=33/124, 26.6%) 13,25,27,29,30,34–36,41,43,50,51,54,55,58,61,64,68–70,74,76,79–81,83,87,89,90,93,96,98,99,101,105 , Bard (n=10/124, 8.1%; now known as Gemini) 33,48,49,55,73,74,80,87,94,99 , Bing Chat (n=7/124, 5.7%; now Microsoft Copilot) 49,51,55,73,94,99,110 , and other applications based on Bidirectional Encoder Representations from Transformers (BERT; n=4/124, 3...…”

Section: Resultsmentioning

confidence: 99%

“…Most studies examined one or more applications based on the GPT-3.5 architecture (n=66 of 124 different LLMs examined per study, 53.2%) 13,26–29,31–34,36–40,42–49,52–54,56–61,63,65–67,71,72,74,75,77,78,81–89,91,92,94,95,97–100,102–104,106–109,111 , followed by GPT-4 (n=33/124, 26.6%) 13,25,27,29,30,34–36,41,43,50,51,54,55,58,61,64,68–70,74,76,79–81,83,87,89,90,93,96,98,99,101,105 , Bard (n=10/124, 8.1%; now known as Gemini) 33,48,49,55,73,74,80,87,94,99 , Bing Chat (n=7/124, 5.7%; now Microsoft Copilot) 49,51,55,73,94,99,110 , and other applications based on Bidirectional Encoder Representations from Transformers (BERT; n=4/124, 3.2%) 13,83,84 , Large Language Model Meta-AI (LLaMA; n=3/124, 2.4%) 55 , or Claude by Anthropic (n=1/124, 0.8%) 55 . The majority of applications were p...…”

Section: Resultsmentioning

confidence: 99%

“…A total of 18 (n=18/89, 20.2%) studies reported the presence of conflicts of interest and funding. 13,24,38,40,54,58,59,67,69–71,74,80,84,96,103,105,111 Most studies did not report information about the institutional review board (IRB) approval (n=55/89, 61.8%) or deemed IRB approval unnecessary (n=28/89, 31.5%). Six studies obtained IRB approval (n=6/89, 6.7%).…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Busch,

Hoffmann,

Rueger

et al. 2024

Preprint

View full text Add to dashboard Cite

The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care using a data-driven convergent synthesis approach. We searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4,349 initial records, 89 studies across 29 medical specialties were included, primarily examining models based on the GPT-3.5 (53.2%, n=66 of 124 different LLMs examined per study) and GPT-4 (26.6%, n=33/124) architectures in medical question answering, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations included 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations included 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. In conclusion, this study is the first review to systematically map LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.

show abstract

“…AI‐LLMs have been utilised as a decision support tool for selecting imaging examinations and generating radiology referrals in the emergency department setting [2]. Additionally, AI‐LLMs have been demonstrated to produce highly accurate, digestible information for patients regarding various orthopaedic pathologies and procedures including anterior cruciate ligament tears and shoulder stabilisation procedures [9, 15].…”

Section: Introductionmentioning

confidence: 99%

From technical to understandable: Artificial Intelligence Large Language Models improve the readability of knee radiology reports

Butler,

Puleo,

Harrington

et al. 2024

Knee surg. sports traumatol. arthrosc.

View full text Add to dashboard Cite

PurposeThe purpose of this study was to evaluate the effectiveness of an Artificial Intelligence‐Large Language Model (AI‐LLM) at improving the readability of knee radiology reports.MethodsReports of 100 knee X‐rays, 100 knee computed tomography (CT) scans and 100 knee magnetic resonance imaging (MRI) scans were retrieved. The following prompt command was inserted into the AI‐LLM: ‘Explain this radiology report to a patient in layman's terms in the second person:[Report Text]’. The Flesch–Kincaid reading level (FKRL) score, Flesch reading ease (FRE) score and report length were calculated for the original radiology report and the AI‐LLM generated report. Any ‘hallucination’ or inaccurate text produced by the AI‐LLM‐generated report was documented.ResultsStatistically significant improvements in mean FKRL scores in the AI‐LLM generated X‐ray report (12.7 ± 1.0–7.2 ± 0.6), CT report (13.4 ± 1.0–7.5 ± 0.5) and MRI report (13.5 ± 0.9–7.5 ± 0.6) were observed. Statistically significant improvements in mean FRE scores in the AI‐LLM generated X‐ray report (39.5 ± 7.5–76.8 ± 5.1), CT report (27.3 ± 5.9–73.1 ± 5.6) and MRI report (26.8 ± 6.4–73.4 ± 5.0) were observed. Superior FKRL scores and FRE scores were observed in the AI‐LLM‐generated X‐ray report compared to the AI‐LLM‐generated CT report and MRI report, p < 0.001. The hallucination rates in the AI‐LLM generated X‐ray report, CT report and MRI report were 2%, 5% and 5%, respectively.ConclusionsThis study highlights the promising use of AI‐LLMs as an innovative, patient‐centred strategy to improve the readability of knee radiology reports. The clinical relevance of this study is that an AI‐LLM‐generated knee radiology report may enhance patients' understanding of their imaging reports, potentially reducing the responder burden placed on the ordering physicians. However, due to the ‘hallucinations’ produced by the AI‐LLM‐generated report, the ordering physician must always engage in a collaborative discussion with the patient regarding both reports and the corresponding images.Level of EvidenceLevel IV.

show abstract

Evaluation High-Quality of Information from ChatGPT (Artificial Intelligence—Large Language Model) Artificial Intelligence on Shoulder Stabilization Surgery

Cited by 30 publications

References 22 publications

How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses

How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

From technical to understandable: Artificial Intelligence Large Language Models improve the readability of knee radiology reports

Contact Info

Product

Resources

About