ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board–style questions

Hopkins, Ben; Nguyen, Vincent; Dallas, Jonathan; Texakalidis, Pavlos; Yang, Max; Renn, Alex; Guerra, Gage; Kashif, Zain; Cheok, Stephanie; Zada, Gabriel; Mack, William J.

doi:10.3171/2023.2.jns23419

Cited by 28 publications

(14 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When assessed on more than 2,000 questions using all parts of the CNS SANS question bank, ChatGPT achieved a fairly unimpressive overall accuracy of 50.4% (1,068/2,120). Our findings corroborate those of Hopkins et al, who found a similar accuracy of 54.9% (262/477) using non-imaging questions from another question bank [ 19 ]. In contrast, Ali et al reported a much higher accuracy of 73.4% (367/500) using both imaging and non-imaging questions from part one of the CNS SANS question bank [ 20 ].…”

Section: Discussionsupporting

confidence: 92%

Educational Limitations of ChatGPT in Neurosurgery Board Preparation

Powers,

McCandless,

Taussky

et al. 2024

Cureus

View full text Add to dashboard Cite

Objective This study evaluated the potential of Chat Generative Pre-trained Transformer (ChatGPT) as an educational tool for neurosurgery residents preparing for the American Board of Neurological Surgery (ABNS) primary examination. Methods Non-imaging questions from the Congress of Neurological Surgeons (CNS) Self-Assessment in Neurological Surgery (SANS) online question bank were input into ChatGPT. Accuracy was evaluated and compared to human performance across subcategories. To quantify ChatGPT’s educational potential, the concordance and insight of explanations were assessed by multiple neurosurgical faculty. Associations among these metrics as well as question length were evaluated. Results ChatGPT had an accuracy of 50.4% (1,068/2,120), with the highest and lowest accuracies in the pharmacology (81.2%, 13/16) and vascular (32.9%, 91/277) subcategories, respectively. ChatGPT performed worse than humans overall, as well as in the functional, other, peripheral, radiology, spine, trauma, tumor, and vascular subcategories. There were no subjects in which ChatGPT performed better than humans and its accuracy was below that required to pass the exam. The mean concordance was 93.4% (198/212) and the mean insight score was 2.7. Accuracy was negatively associated with question length (R 2 =0.29, p=0.03) but positively associated with both concordance (p<0.001, q<0.001) and insight (p<0.001, q<0.001). Conclusions The current study provides the largest and most comprehensive assessment of the accuracy and explanatory quality of ChatGPT in answering ABNS primary exam questions. The findings demonstrate shortcomings regarding ChatGPT’s ability to pass, let alone teach, the neurosurgical boards.

show abstract

Section: Discussionsupporting

confidence: 92%

Educational Limitations of ChatGPT in Neurosurgery Board Preparation

Powers,

McCandless,

Taussky

et al. 2024

Cureus

View full text Add to dashboard Cite

show abstract

“…Overall, the phenomenal improvement in the test-taking performance of ChatGPT 4 compared to ChatGPT 3.5 raises intriguing questions regarding future applications and implications of AI in medical education and diagnostics. AI has shown its prowess not only on the USMLE examinations in medical education but also on advanced examinations, such as the neurosurgical written boards [16]. This phenomenon ventures into other aspects of medicine as well, including research and clinical performance [17].…”

Section: Principal Findingsmentioning

confidence: 99%

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

Knoedler,

Alfertshofer,

Knoedler

et al. 2024

JMIR Med Educ

View full text Add to dashboard Cite

Background The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student’s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT’s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. Objective This paper aimed to analyze ChatGPT’s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. Methods A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. Results Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=–0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=–0.289 for ChatGPT 3.5 and ρ=–0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. Conclusions In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.

show abstract

“…Despite not being trained on a specific data set, ChatGPT performed at the level of a first‐year resident in plastic surgery on the in‐service training exam 7,8 . In neurosurgery, ChatGPT performed worse than the average user on Self‐Assessment Neurosurgery questions but better than residents in some topics 9 . Clearly, there is already some rudimentary capacity in providing specialty care.…”

Section: Discussionmentioning

confidence: 99%

Diagnostic and Management Applications of ChatGPT in Structured Otolaryngology Clinical Scenarios

Qu,

Qureshi,

Petersen

et al. 2023

OTO Open

View full text Add to dashboard Cite

ObjectiveTo evaluate the clinical applications and limitations of chat generative pretrained transformer (ChatGPT) in otolaryngology.Study DesignCross‐sectional survey.SettingTertiary academic center.MethodsChatGPT 4.0 was queried for diagnoses and management plans for 20 physician‐written clinical vignettes in otolaryngology. Attending physicians were then asked to rate the difficulty of the clinical vignettes and agreement with the differential diagnoses and management plans of ChatGPT responses on a 5‐point Likert scale. Summary statistics were calculated. Univariate ordinal regression was then performed between vignette difficulty and quality of the diagnoses and management plans.ResultsEleven attending physicians completed the survey (61% response rate). Overall, vignettes were rated as very easy to neutral difficulty (range of median score: 1.00‐4.00; overall median 2.00). There was a high agreement with the differential diagnosis provided by ChatGPT (range of median score: 3.00‐5.00; overall median: 5.00). There was also high agreement with treatment plans (range of median score: 3.00‐5.00; overall median: 5.00). There was no association between vignette difficulty and agreement with differential diagnosis or treatment. Lower diagnosis scores had greater odds of having lower treatment scores.ConclusionGenerative artificial intelligence models like ChatGPT are being rapidly adopted in medicine. Performance with curated, easy‐to‐moderate difficulty otolaryngology scenarios indicate high agreement with physicians for diagnosis and management. However, a decreased quality in diagnosis is associated with decreased quality in management. Further research is necessary on ChatGPT's ability to handle unstructured clinical information.

show abstract

ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board–style questions

Cited by 28 publications

References 1 publication

Educational Limitations of ChatGPT in Neurosurgery Board Preparation

Educational Limitations of ChatGPT in Neurosurgery Board Preparation

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

Diagnostic and Management Applications of ChatGPT in Structured Otolaryngology Clinical Scenarios

Contact Info

Product

Resources

About