Performance of Large Language Models on a Neurology Board–Style Examination

Schubert, Marc Cicero; Wick, Wolfgang; Venkataramani, Varun

doi:10.1001/jamanetworkopen.2023.46721

Cited by 25 publications

(6 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are multiple ways that can provide users with greater confidence about the accuracy of AI search results. The use of a "custom attribution engine" as announced by Adobe would allow users to verify AI findings through source citation [14]. This type of approach should allow users to interpret narrative results in terms of source information and determine if there is any distortion in the AI narrative.…”

Section: Discussionmentioning

confidence: 99%

Replies to Queries in Gynecologic Oncology by Bard, Bing and the Google Assistant

Pavlik,

Ramaiah,

Rives

et al. 2024

Preprint

View full text Add to dashboard Cite

1) Background: When women receive a diagnosis of a gynecologic malignancy, they can have questions about their diagnosis or treatment that can result in voice queries to virtual assistants. Recent advancement in artificial intelligence (AI) has transformed the landscape of medical information accessibility. The Google virtual assistant (VA) outperformed Siri, Alexa, and Cortana in voice queries presented prior to the explosive implementation of AI in early 2023. The efforts presented here focus on determining if advances in AI in the last 12 months improved the accuracy of Google VA responses related to gynecologic oncology. (2) Methods: Previous questions were utilized to form a common basis for queries prior to 2023 and responses in 2024. Correct answers were obtained from the UpToDate medical resource. Responses related to gynecologic oncology were obtained using Google VA, as well as the generative AI chatbots Google Bard/Gemini and Microsoft Bing-Copilot. (3) Results: The AI narrative responses varied in length and positioning of answers within the response. Google Bard/Gemini achieved an 87.5% accuracy rate, while Microsoft Bing-Copilot reached 83.3%. In contrast, the Google VA’s accuracy in audible responses improved from 18% prior to 2023 to 63% in 2024. (4) Conclusion: While the accuracy of the Google VA has improved in the last year, it underperformed Google Bard/Gemini and Microsoft Bing-Copilot so that there is considerable room for further improved accuracy.

show abstract

Section: Discussionmentioning

confidence: 99%

Replies to Queries in Gynecologic Oncology by Bard, Bing and the Google Assistant

Pavlik,

Ramaiah,

Rives

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…The study found that ChatGPT models performed notably poorer on questions involving concepts of redefinition or invention [27]. Emerging evidence suggests that AI may exhibit its own cognitive process, as indirectly indicated by a trend of improved performance on questions at the lower levels of Bloom's taxonomy, particularly in disciplines such as neurology, radiology, physiology, microbiology, and biochemistry [28][29][30][31][32]. In these investigations, the majority of questions assessed originated from internal materials or were inaccessible to ChatGPT models during the study periods.…”

Section: Comparison With Prior Workmentioning

confidence: 94%

Evaluating the Cognitive Levels of Generative AI via Bloom’s Taxonomy: A Cross-sectional Study (Preprint)

Huang,

Liu,

et al. 2024

Preprint

View full text Add to dashboard Cite

BACKGROUND Generative AI has garnered awareness in the medical field, yet its potential is constrained by inherent limitations. By responding to inputs through predicting the next word from its memory-based archive, we aim to explore some of these constraints from a medical education and psychological perspective, utilizing Bloom’s taxonomy. OBJECTIVE To assess AI's cognitive functions in the medical sector by examining its performance through medical licensing exams and applying Bloom's taxonomy. METHODS Questions from the Taiwan Medical Licensing Examination (TMLE) (August 2022) and the third step of the United States Medical Licensing Examination (USMLE) (August 2022) were classified based on Bloom's taxonomy levels. The ChatGPT versions were tasked through individual prompts, with questions entered separately into ChatGPT-3.5 and ChatGPT-4 using different accounts. After each response, the chat logs were erased and reset to ensure the independence of each answer. Responses from ChatGPT-3.5 and ChatGPT-4, collected between January and February 2024, were analyzed. The questions from both exams were available online during the study period. RESULTS Although the overall performance of ChatGPT-4 surpassed that of ChatGPT-3.5, the analysis of responses from both models across various cognitive levels revealed no significant correlation between their performance and the levels of Bloom's taxonomy. This lack of significance persisted even when considering the strength of ChatGPTs in their extensive databases classified under "remember," compared to other cognitive levels labeled as "non-remember." CONCLUSIONS In the medical field, ChatGPT models may utilize their "remember" function to answer all types of questions across all categories defined by Bloom's taxonomy. Further research is required focusing on different versions, medical specialties, and the level of difficulty assessed by individuals from various backgrounds.

show abstract

“…Previous publications evaluating LLMs across various disciplines have covered fields such as, gastroenterology 7 , pathology 8 , neurology 9 , physiology 6 , 10 , and solving case vignettes in physiology 11 . In a cross-sectional study, the performance of LLMs on neurology board–style examinations were assessed using a question bank approved by the American Board of Psychiatry and Neurology.…”

Section: Introductionmentioning

confidence: 99%

“…In a cross-sectional study, the performance of LLMs on neurology board–style examinations were assessed using a question bank approved by the American Board of Psychiatry and Neurology. The questions were categorized into lower-order and higher-order based on the Bloom taxonomy for learning and assessment 9 . To the best of our knowledge there was no study specifically on evaluating LLMs in the field of neurophysiology.…”

Section: Introductionmentioning

confidence: 99%

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions

Shojaee-Mend,

Mohebbati,

Amiri

et al. 2024

Sci Rep

View full text Add to dashboard Cite

Large language models (LLMs), like ChatGPT, Google’s Bard, and Anthropic’s Claude, showcase remarkable natural language processing capabilities. Evaluating their proficiency in specialized domains such as neurophysiology is crucial in understanding their utility in research, education, and clinical applications. This study aims to assess and compare the effectiveness of Large Language Models (LLMs) in answering neurophysiology questions in both English and Persian (Farsi) covering a range of topics and cognitive levels. Twenty questions covering four topics (general, sensory system, motor system, and integrative) and two cognitive levels (lower-order and higher-order) were posed to the LLMs. Physiologists scored the essay-style answers on a scale of 0–5 points. Statistical analysis compared the scores across different levels such as model, language, topic, and cognitive levels. Performing qualitative analysis identified reasoning gaps. In general, the models demonstrated good performance (mean score = 3.87/5), with no significant difference between language or cognitive levels. The performance was the strongest in the motor system (mean = 4.41) while the weakest was observed in integrative topics (mean = 3.35). Detailed qualitative analysis uncovered deficiencies in reasoning, discerning priorities, and knowledge integrating. This study offers valuable insights into LLMs’ capabilities and limitations in the field of neurophysiology. The models demonstrate proficiency in general questions but face challenges in advanced reasoning and knowledge integration. Targeted training could address gaps in knowledge and causal reasoning. As LLMs evolve, rigorous domain-specific assessments will be crucial for evaluating advancements in their performance.

show abstract

Performance of Large Language Models on a Neurology Board–Style Examination

Cited by 25 publications

References 28 publications

Replies to Queries in Gynecologic Oncology by Bard, Bing and the Google Assistant

Replies to Queries in Gynecologic Oncology by Bard, Bing and the Google Assistant

Evaluating the Cognitive Levels of Generative AI via Bloom’s Taxonomy: A Cross-sectional Study (Preprint)

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions

Contact Info

Product

Resources

About