Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation

Gobira, Mauro; Nakayama, Luis Filipe; Moreira, Rodrigo; Andrade, Eric; Regatieri, Caio Vinicius Saito; Belfort Jr., Rubens

doi:10.1590/1806-9282.20230848

Cited by 27 publications

(13 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Results were similar among the American Academy of Ophthalmology's Basic and Clinical Science Course (46.0%-84.3%), 41,42,46,55 Ophthoquestions (42.7%-84%), 41,47,48 Fellow of The Royal College of Ophthalmologists (FRCOphth) examination questions (32%-88.4%) 51,57 and Statpearls (55.5%-73.2%). 49 In comparison, a lower score was observed in Brazil board examination questions (41.5%) 44 and higher in European board examinations (91%). 50 The performance of LLMs was found to be better for the subspecialties of medicine, cornea, refractive surgery and oncology, and weakest for glaucoma, neuro-ophthalmology, pathology, tumours, optics, oculoplastic and mathematical concepts.…”

Section: Diagnosis Informaɵonmentioning

confidence: 81%

“…LLMs regularly outperformed the threshold of ophthalmological specialist examinations. [40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57] However, the accuracy of LLMs was 66.9% (22.4%-91%) while the ophthalmology trainees scored 68.4% (33%-75.7%-Table 3).…”

Section: Performance In Qualifying Ophthalmological Board Examination...mentioning

confidence: 99%

“…50 The performance of LLMs was found to be better for the subspecialties of medicine, cornea, refractive surgery and oncology, and weakest for glaucoma, neuro-ophthalmology, pathology, tumours, optics, oculoplastic and mathematical concepts. The weakness of GPT-3 and GPT-3.5 in answering questions on retina and vitreous (0%-23.1%) 44,48 was overcome in a later version (GPT-4: 100% correct). 47 Both GPT-3.5 and GPT-4's performance was better for first-order questions (recall) and lower for higher order (evaluative/analytical) and image-based questions.…”

Section: Diagnosis Informaɵonmentioning

confidence: 99%

See 2 more Smart Citations

Utility of artificial intelligence‐based large language models in ophthalmic care

Biswas,

Davies,

Sheppard

et al. 2024

Ophthalmic Physiologic Optic

View full text Add to dashboard Cite

PurposeWith the introduction of ChatGPT, artificial intelligence (AI)‐based large language models (LLMs) are rapidly becoming popular within the scientific community. They use natural language processing to generate human‐like responses to queries. However, the application of LLMs and comparison of the abilities among different LLMs with their human counterparts in ophthalmic care remain under‐reported.Recent FindingsHitherto, studies in eye care have demonstrated the utility of ChatGPT in generating patient information, clinical diagnosis and passing ophthalmology question‐based examinations, among others. LLMs' performance (median accuracy, %) is influenced by factors such as the iteration, prompts utilised and the domain. Human expert (86%) demonstrated the highest proficiency in disease diagnosis, while ChatGPT‐4 outperformed others in ophthalmology examinations (75.9%), symptom triaging (98%) and providing information and answering questions (84.6%). LLMs exhibited superior performance in general ophthalmology but reduced accuracy in ophthalmic subspecialties. Although AI‐based LLMs like ChatGPT are deemed more efficient than their human counterparts, these AIs are constrained by their nonspecific and outdated training, no access to current knowledge, generation of plausible‐sounding ‘fake’ responses or hallucinations, inability to process images, lack of critical literature analysis and ethical and copyright issues. A comprehensive evaluation of recently published studies is crucial to deepen understanding of LLMs and the potential of these AI‐based LLMs.SummaryOphthalmic care professionals should undertake a conservative approach when using AI, as human judgement remains essential for clinical decision‐making and monitoring the accuracy of information. This review identified the ophthalmic applications and potential usages which need further exploration. With the advancement of LLMs, setting standards for benchmarking and promoting best practices is crucial. Potential clinical deployment requires the evaluation of these LLMs to move away from artificial settings, delve into clinical trials and determine their usefulness in the real world.

show abstract

Section: Diagnosis Informaɵonmentioning

confidence: 81%

Section: Performance In Qualifying Ophthalmological Board Examination...mentioning

confidence: 99%

Section: Diagnosis Informaɵonmentioning

confidence: 99%

See 1 more Smart Citation

Utility of artificial intelligence‐based large language models in ophthalmic care

Biswas,

Davies,

Sheppard

et al. 2024

Ophthalmic Physiologic Optic

View full text Add to dashboard Cite

show abstract

“…Examination in Japan [56,57], and the Brazilian National Examination for Medical Degree Revalidation [58].…”

Section: Discussionmentioning

confidence: 99%

“…On the other hand, the performance of the AI models in this study was not entirely an unexpected finding. This comes in light of the recent evidence showing AI models’ abilities to pass reputable exams in multiple languages such as the USMLE [37], the German State Examination in Medicine [55], the National Medical Licensing Examination in Japan [56, 57], and the Brazilian National Examination for Medical Degree Revalidation [58].…”

Section: Discussionmentioning

confidence: 99%

Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5, and Humans in Clinical Chemistry Multiple-Choice Questions

Sallam,

Al-Salahat,

Eid

et al. 2024

Preprint

View full text Add to dashboard Cite

The advances in large language models (LLMs) are evolving rapidly. Artificial intelligence (AI) chatbots based on LLMs excel in language understanding and generation, with potential utility to transform healthcare education and practice. However, it is important to assess the performance of such AI models in various topics to highlight its strengths and possible limitations. Therefore, this study aimed to evaluate the performance of ChatGPT (GPT-3.5 and GPT-4), Bing, and Bard compared to human students at a postgraduate master’s (MSc) level in Medical Laboratory Sciences. The study design was based on the METRICS checklist for the design and reporting of AI-based studies in healthcare. The study utilized a dataset of 60 Clinical Chemistry multiple-choice questions (MCQs) initially conceived for assessment of 20 MSc students. The revised Bloom’s taxonomy was used as the framework for classifying the MCQs into four cognitive categories: Remember, Understand, Analyze, and Apply. A modified version of the CLEAR tool was used for assessment of the quality of AI-generated content, with Cohen’s κ for inter-rater agreement. Compared to the mean students’ score which was 40/60 (66.8%), GPT-4 scored 54/60 (90.0%), followed by Bing (46/60, 76.7%), GPT-3.5 (44/60, 73.3%), and Bard (40/60, 66.7%). Statistically significant better performance was noted in lower cognitive domains (Remember and Understand) in GPT-3.5, GPT-4, and Bard. The CLEAR scores indicated that ChatGPT-4 performance was “Excellent” compared to “Above average” performance of ChatGPT-3.5, Bing, and Bard. The findings indicated that ChatGPT-4 excelled in the Clinical Chemistry exam, while ChatGPT-3.5, Bing, and Bard were above-average. Given that the MCQs were directed to postgraduate students with a high degree of specialization, the performance of these AI chatbots was remarkable. Due to the risks of academic dishonesty and possible dependence on these AI models, the appropriateness of MCQs as an assessment tool in higher education should be re-evaluated.

show abstract

Comparison of emergency medicine specialist, cardiologist, and chat-GPT in electrocardiography assessment

Günay,

Öztürk,

Özerol

et al. 2024

The American Journal of Emergency Medicine

View full text Add to dashboard Cite

Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation

Cited by 27 publications

References 9 publications

Utility of artificial intelligence‐based large language models in ophthalmic care

Utility of artificial intelligence‐based large language models in ophthalmic care

Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5, and Humans in Clinical Chemistry Multiple-Choice Questions

Comparison of emergency medicine specialist, cardiologist, and chat-GPT in electrocardiography assessment

Contact Info

Product

Resources

About