Evaluating the Performance of ChatGPT in Ophthalmology

Antaki, Fares; Touma, Samir; Milad, Daniel; El-Khoury, Jonathan; Duval, Renaud

doi:10.1016/j.xops.2023.100324

Cited by 189 publications

(40 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Umer and Habib (2022) argue that an acceptable threshold for accuracy in diagnostic tasks should be set at >90%. However, our result is similar to other studies (Antaki et al., 2023; Gilson et al., 2023; Huh, 2023; Kung et al., 2023; Thirunavukarasu et al., 2023), that have recognized the promising potential of this LLM for medical education and clinical decision‐making. However, some studies have warned of the variability in response accuracy, with incomplete or incorrect answers being common (Lahat et al., 2023; Samaan et al., 2023).…”

Section: Discussionsupporting

confidence: 92%

“…In recent months, there has been a growing interest in the application of LLMs in medicine, particularly in exploring their clinical utility, as evidenced by the emergence of ChatGPT (Antaki et al, 2023;Ge & Lai, 2023;Lahat et al, 2023). Despite the promising results demonstrated by these models, it is crucial to perform a comprehensive evaluation of their performance and potential errors before determining their viability in a clinical setting (Antaki et al, 2023). In this context, a study was conducted to evaluate the consistency and accuracy of answers provided by ChatGPT to questions related to clinical situations in endodontics.…”

Section: Discussionmentioning

confidence: 99%

“…Due to the probabilistic nature of LLMs, which can lead to variability in their answers (Antaki et al., 2023), this study analysed the agreement of the 60 answers for each question. It was observed that the model achieved a satisfactory level of consistency, reaching 85.4%.…”

Section: Discussionmentioning

confidence: 99%

See 2 more Smart Citations

Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers

Suárez,

Díaz‐Flores García,

Algar

et al. 2023

Int Endodontic J

View full text Add to dashboard Cite

AimChatbot Generative Pre‐trained Transformer (ChatGPT) is a generative artificial intelligence (AI) software based on large language models (LLMs), designed to simulate human conversations and generate novel content based on the training data it has been exposed to. The aim of this study was to evaluate the consistency and accuracy of ChatGPT‐generated answers to clinical questions in endodontics, compared to answers provided by human experts.MethodologyNinety‐one dichotomous (yes/no) questions were designed and categorized into three levels of difficulty. Twenty questions were randomly selected from each difficulty level. Sixty answers were generated by ChatGPT for each question. Two endodontic experts independently answered the 60 questions. Statistical analysis was performed using the SPSS program to calculate the consistency and accuracy of the answers generated by ChatGPT compared to the experts. Confidence intervals (95%) and standard deviations were used to estimate variability.ResultsThe answers generated by ChatGPT showed high consistency (85.44%). No significant differences in consistency were found based on question difficulty. In terms of answer accuracy, ChatGPT achieved an average accuracy of 57.33%. However, significant differences in accuracy were observed based on question difficulty, with lower accuracy for easier questions.ConclusionsCurrently, ChatGPT is not capable of replacing dentists in clinical decision‐making. As ChatGPT's performance improves through deep learning, it is expected to become more useful and effective in the field of endodontics. However, careful attention and ongoing evaluation are needed to ensure its accuracy, reliability and safety in endodontics.

show abstract

Section: Discussionsupporting

confidence: 92%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers

Suárez,

Díaz‐Flores García,

Algar

et al. 2023

Int Endodontic J

View full text Add to dashboard Cite

show abstract

“…For example, it did not perform as well as medical students in Korea on parasitology, 9 and it achieved only 55.8% and 42.7% accuracy on high-stakes Ophthalmic Knowledge Assessment Program exams. 10 ChatGPT did not reach the passing threshold for any of the life support exams, but its answers were generally relevant and accurate and showed better congruence with resuscitation guidelines than those of similar AI systems in previous studies. 11…”

Section: Introductionmentioning

confidence: 57%

“…ChatGPT has shown various abilities in the medical field, performing better in general medicine. 10 Therefore, in this article, we tested ChatGPT with Taiwan’s Family Medicine Board Exam. The process of medical education in Taiwan involves 6 years of study in medical school, followed by the national physician licensing exam to obtain a medical license.…”

Section: Introductionmentioning

confidence: 99%

ChatGPT failed Taiwan’s Family Medicine Board Exam

Weng

Wang

Chang

et al. 2023

Journal of the Chinese Medical Association

View full text Add to dashboard Cite

Background: Chat Generative Pre-trained Transformer (ChatGPT), OpenAI Limited Partnership, San Francisco, CA, USA is an artificial intelligence language model gaining popularity because of its large database and ability to interpret and respond to various queries. Although it has been tested by researchers in different fields, its performance varies depending on the domain. We aimed to further test its ability in the medical field. Methods: We used questions from Taiwan’s 2022 Family Medicine Board Exam, which combined both Chinese and English and covered various question types, including reverse questions and multiple-choice questions, and mainly focused on general medical knowledge. We pasted each question into ChatGPT and recorded its response, comparing it to the correct answer provided by the exam board. We used SAS 9.4 (Cary, North Carolina, USA) and Excel to calculate the accuracy rates for each question type. Results: ChatGPT answered 52 questions out of 125 correctly, with an accuracy rate of 41.6%. The questions’ length did not affect the accuracy rates. These were 45.5%, 33.3%, 58.3%, 50.0%, and 43.5% for negative-phrase questions, multiple-choice questions, mutually exclusive options, case scenario questions, and Taiwan’s local policy-related questions, with no statistical difference observed. Conclusion: ChatGPT’s accuracy rate was not good enough for Taiwan’s Family Medicine Board Exam. Possible reasons include the difficulty level of the specialist exam and the relatively weak database of traditional Chinese language resources. However, ChatGPT performed acceptably in negative-phrase questions, mutually exclusive questions, and case scenario questions, and it can be a helpful tool for learning and exam preparation. Future research can explore ways to improve ChatGPT’s accuracy rate for specialized exams and other domains.

show abstract

Assessment of a Large Language Model’s Responses to Questions and Cases About Glaucoma and Retina Management

Huang,

Hirabayashi,

Barna

et al. 2024

JAMA Ophthalmol

View full text Add to dashboard Cite

ImportanceLarge language models (LLMs) are revolutionizing medical diagnosis and treatment, offering unprecedented accuracy and ease surpassing conventional search engines. Their integration into medical assistance programs will become pivotal for ophthalmologists as an adjunct for practicing evidence-based medicine. Therefore, the diagnostic and treatment accuracy of LLM-generated responses compared with fellowship-trained ophthalmologists can help assess their accuracy and validate their potential utility in ophthalmic subspecialties.ObjectiveTo compare the diagnostic accuracy and comprehensiveness of responses from an LLM chatbot with those of fellowship-trained glaucoma and retina specialists on ophthalmological questions and real patient case management.Design, Setting, and ParticipantsThis comparative cross-sectional study recruited 15 participants aged 31 to 67 years, including 12 attending physicians and 3 senior trainees, from eye clinics affiliated with the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. Glaucoma and retina questions (10 of each type) were randomly selected from the American Academy of Ophthalmology’s Commonly Asked Questions. Deidentified glaucoma and retinal cases (10 of each type) were randomly selected from ophthalmology patients seen at Icahn School of Medicine at Mount Sinai–affiliated clinics. The LLM used was GPT-4 (version dated May 12, 2023). Data were collected from June to August 2023.Main Outcomes and MeasuresResponses were assessed via a Likert scale for medical accuracy and completeness. Statistical analysis involved the Mann-Whitney U test and the Kruskal-Wallis test, followed by pairwise comparison.ResultsThe combined question-case mean rank for accuracy was 506.2 for the LLM chatbot and 403.4 for glaucoma specialists (n = 831; Mann-Whitney U = 27976.5; P &lt; .001), and the mean rank for completeness was 528.3 and 398.7, respectively (n = 828; Mann-Whitney U = 25218.5; P &lt; .001). The mean rank for accuracy was 235.3 for the LLM chatbot and 216.1 for retina specialists (n = 440; Mann-Whitney U = 15518.0; P = .17), and the mean rank for completeness was 258.3 and 208.7, respectively (n = 439; Mann-Whitney U = 13123.5; P = .005). The Dunn test revealed a significant difference between all pairwise comparisons, except specialist vs trainee in rating chatbot completeness. The overall pairwise comparisons showed that both trainees and specialists rated the chatbot’s accuracy and completeness more favorably than those of their specialist counterparts, with specialists noting a significant difference in the chatbot’s accuracy (z = 3.23; P = .007) and completeness (z = 5.86; P &lt; .001).Conclusions and RelevanceThis study accentuates the comparative proficiency of LLM chatbots in diagnostic accuracy and completeness compared with fellowship-trained ophthalmologists in various clinical scenarios. The LLM chatbot outperformed glaucoma specialists and matched retina specialists in diagnostic and treatment accuracy, substantiating its role as a promising diagnostic adjunct in ophthalmology.

show abstract

Evaluating the Performance of ChatGPT in Ophthalmology

Cited by 189 publications

References 15 publications

Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers

Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers

ChatGPT failed Taiwan’s Family Medicine Board Exam

Assessment of a Large Language Model’s Responses to Questions and Cases About Glaucoma and Retina Management

Contact Info

Product

Resources

About