2023
DOI: 10.1227/neu.0000000000002551
|View full text |Cite
|
Sign up to set email alerts
|

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

Abstract: BACKGROUND AND OBJECTIVES: General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated the capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. We aimed to assess the performance of 3 LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparati… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

7
61
5

Year Published

2023
2023
2024
2024

Publication Types

Select...
7

Relationship

2
5

Authors

Journals

citations
Cited by 113 publications
(73 citation statements)
references
References 5 publications
7
61
5
Order By: Relevance
“…Researchers and developers must consider the impact of their models on society and work toward creating AI systems with appropriate training data sets, feedback mechanisms, and guardrails that accurately reflect our diverse world. Moreover, the active involvement of surgeons and other clinicians in the refinement of these models, such as serving in advisory roles, may help combat inaccuracies in the perception of the discipline of surgery by the general public …”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Researchers and developers must consider the impact of their models on society and work toward creating AI systems with appropriate training data sets, feedback mechanisms, and guardrails that accurately reflect our diverse world. Moreover, the active involvement of surgeons and other clinicians in the refinement of these models, such as serving in advisory roles, may help combat inaccuracies in the perception of the discipline of surgery by the general public …”
Section: Discussionmentioning
confidence: 99%
“…Moreover, the active involvement of surgeons and other clinicians in the refinement of these models, such as serving in advisory roles, may help combat inaccuracies in the perception of the discipline of surgery by the general public. 7,22…”
Section: Limitationsmentioning
confidence: 99%
“…Chat GPT has failed to achieve a passing grade in a gastroenterology board-like examination, 1 achieves close to or near pass levels in cardiology and radiology practice question, 2,3 while outstanding performance has been noted in Neurosurgery board finals. 4 It is unclear to what extent this applies to emergency medicine, a field that requires knowledge from a broad range of interdisciplinary sciences and disciplines.…”
Section: Introductionmentioning
confidence: 99%
“…As medical professionals, we have an ethical duty to ensure that AI technologies adhere to the highest safety and efficacy standards, regardless of the technical capabilities of such systems. Neurosurgical cases often have substantive clinical equipoise, and existing AI models can struggle with higher-order management questions or even produce “hallucinations.” 2,3 Accordingly, our recent work has elucidated that certain AI systems not infrequently confabulate fabricated or incorrect answer rationales, especially when lacking contextual data. 3 However, even if AI models did not hallucinate or struggle with higher-order reasoning, the onus would still rest on neurosurgeons to assess suitability for clinical application.…”
mentioning
confidence: 99%
“…Neurosurgical cases often have substantive clinical equipoise, and existing AI models can struggle with higher-order management questions or even produce “hallucinations.” 2,3 Accordingly, our recent work has elucidated that certain AI systems not infrequently confabulate fabricated or incorrect answer rationales, especially when lacking contextual data. 3 However, even if AI models did not hallucinate or struggle with higher-order reasoning, the onus would still rest on neurosurgeons to assess suitability for clinical application. Superimposing human judgement remains critical in achieving optimal patient outcomes.…”
mentioning
confidence: 99%