Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries

Pushpanathan, Krithi; Lim, Zhi Wei; Er Yew, Samantha Min; Chen, David Ziyou; Hui'En Lin, Hazel Anne; Lin Goh, Jocelyn Hui; Wong, Wendy Meihua; Wang, Xiaofei; Jin Tan, Marcus Chun; Chang Koh, Victor Teck; Tham, Yih-Chung

doi:10.1016/j.isci.2023.108163

Cited by 25 publications

(14 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors were primarily affiliated with institutions in the United States (n=47 of 122 different countries identified per publication, 38.5%), followed by Germany (n=11/122, 9%), Turkey (n=7/122, 5.7%), the United Kingdom (n=6/122, 4.9%), China/Australia/Italy (n=5/122, 4.1%, respectively), and 24 (n=36/122, 29.5%) other countries. Most studies examined one or more applications based on the GPT-3.5 architecture (n=66 of 124 different LLMs examined per study, 53.2%) 13,26–29,31–34,36–40,42–49,52–54,56–61,63,65–67,71,72,74,75,77,78,81–89,91,92,94,95,97–100,102–104,106–109,111 , followed by GPT-4 (n=33/124, 26.6%) 13,25,27,29,30,34–36,41,43,50,51,54,55,58,61,64,68–70,74,76,79–81,83,87,89,90,93,96,98,99,101,105 , Bard (n=10/124, 8.1%; now known as Gemini) 33,48,49,55,73,74,80,87,94,99 , Bing Chat (n=7/124, 5.7%; now Microsoft Copilot) 49,51,55,73,94,99,110 , and other applications based on Bidirectional Encoder Representations from Transformers (BERT; n=4/124, 3...…”

Section: Resultsmentioning

confidence: 99%

“…A total of 18 (n=18/89, 20.2%) studies reported the presence of conflicts of interest and funding. 13,24,38,40,54,58,59,67,69–71,74,80,84,96,103,105,111 Most studies did not report information about the institutional review board (IRB) approval (n=55/89, 61.8%) or deemed IRB approval unnecessary (n=28/89, 31.5%). Six studies obtained IRB approval (n=6/89, 6.7%).…”

Section: Resultsmentioning

confidence: 99%

“…Most reports evaluated LLMs in English (n=88/89, 98.9%) 13,24–103,105–111 , followed by Arabic (n=2/84, 2.3%) 32,104 , Mandarin (n=2/84, 2.3%) 36,75 , and Korean or Spanish (n=1/89, 1.1%, respectively) 75 . The top-five specialties studied were ophthalmology (n=10/89, 11.2%) 37,40,48,51,65,74,97,98,100,101 , gastro-enterology (n=9/89, 10.1%) 25,32,34,36,39,61,62,72,96 , head and neck surgery/otolaryngology (n=8/89, 9%) 35,42,56,64,66,76,78,79 , and radiology 59,70,88–90,110 or plastic surgery 45,47,49,102,107,108 (n=6/89, 6.7%, respectively). A schematic illustration of the identified concepts of LLM applications in patient care is shown in Figure 2.…”

Section: Resultsmentioning

confidence: 99%

“…The evaluation of limitations in output data yielded 7 second-order codes concerning the non-reproducibility (n=38/89, 42.7%) 28,29,34,38,39,41,43,45,46,49,54–61,64,65,71–73,76,80,82,83,85,90,91,94,96,98,99,101,103–105 , non-comprehensiveness (n=78/89, 87.6%) 13,25,26,28–30,32–44,46,48–62,64,65,67–79,81–98,100,102–107,109–111 , incorrectness (n=78/89, 87.6%) 13,25–44,46,49–52,54–62,64–66,69–79,81–85,87–107,109–111 , (un-)safeness (n=39/89, 43.8%) 28,30,35,37,39,40,42–44,46,50,51,57–60,62,64,65,69,70,73,74,76,78–80,82,84,85,91…”

Section: Resultsmentioning

confidence: 99%

“…Some of the incorrect information could be attributed to what is commonly known as hallucination (n=38/89, 42.7%) 25,28,32,33,35–38,40–44,49–51,57–60,65,73,74,76,77,81,83,85,91,94,96–98,100,103,106,107,109 , i.e., the creation of entirely fictitious or false information that has no basis in the input provided or in reality (e.g., “You may be asked to avoid eating or drinking for a few hours before the scan” for a bone scan). However, numerous instances of misinformation were more appropriately classified under alternative concepts of the original psychiatric analogy, as described in detail by Currie et al 43,112,113 These include illusion (n=12/89, 13.5%) 28,36,38,43,57,59,77,78,85,88,94,105 , which is characterized by the generation of deceptive perceptions or the distortion of information by conflating similar but separate concepts (e.g., suggesting that MRI-type sounds might be experienced during standard nuclear medicine imaging), delirium (n=34/89, 38.2%) 13,26,28,30,37,43,50,58,59,61,65,70,72–75,77,79,81–85,90–92,94,95,98,102,103,107,109,110 , which indicates significant gaps in vital information, resulting in a fragmented or confused understanding of a subject (e.g., omission of crucial information about caffeine cessation for stress myocardial perfusion scans), extrapolation (n=11/89, 12.4%) 43,59,…”

Section: Resultsmentioning

confidence: 99%

See 4 more Smart Citations

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Busch,

Hoffmann,

Rueger

et al. 2024

Preprint

View full text Add to dashboard Cite

The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care using a data-driven convergent synthesis approach. We searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4,349 initial records, 89 studies across 29 medical specialties were included, primarily examining models based on the GPT-3.5 (53.2%, n=66 of 124 different LLMs examined per study) and GPT-4 (26.6%, n=33/124) architectures in medical question answering, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations included 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations included 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. In conclusion, this study is the first review to systematically map LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

See 3 more Smart Citations

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Busch,

Hoffmann,

Rueger

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

Enhancing patient information texts in orthopaedics: How OpenAI's ‘ChatGPT’ can help

Yüce,

Yerli,

Misir

et al. 2024

J. exp. orthop.

View full text Add to dashboard Cite

PurposeThe internet has become a primary source for patients seeking healthcare information, but the quality of online information, particularly in orthopaedics, often falls short. Orthopaedic surgeons now have the added responsibility of evaluating and guiding patients to credible online resources. This study aimed to assess ChatGPT's ability to identify deficiencies in patient information texts related to total hip arthroplasty websites and to evaluate its potential for enhancing the quality of these texts.MethodsIn August 2023, 25 websites related to total hip arthroplasty were assessed using a standardized search on Google. Peer‐reviewed scientific articles, empty pages, dictionary definitions, and unrelated content were excluded. The remaining 10 websites were evaluated using the hip information scoring system (HISS). ChatGPT was then used to assess these texts, identify deficiencies and provide recommendations.ResultsThe mean HISS score of the websites was 9.5, indicating low to moderate quality. However, after implementing ChatGPT's suggested improvements, the score increased to 21.5, signifying excellent quality. ChatGPT's recommendations included using simpler language, adding FAQs, incorporating patient experiences, addressing cost and insurance issues, detailing preoperative and postoperative phases, including references, and emphasizing emotional and psychological support. The study demonstrates that ChatGPT can significantly enhance patient information quality.ConclusionChatGPT's role in elevating patient education regarding total hip arthroplasty is promising. This study sheds light on the potential of ChatGPT as an aid to orthopaedic surgeons in producing high‐quality patient information materials. Although it cannot replace human expertise, it offers a valuable means of enhancing the quality of healthcare information available online.Level of EvidenceLevel IV.

show abstract

Enhancing Hospital Services: Utilizing Chatbot Technology for Patient Inquiries

Dogan,

Faruk Gurcan

2024

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries

Cited by 25 publications

References 46 publications

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Enhancing patient information texts in orthopaedics: How OpenAI's ‘ChatGPT’ can help

Enhancing Hospital Services: Utilizing Chatbot Technology for Patient Inquiries

Contact Info

Product

Resources

About