Diagnostic Accuracy and Failure Mode Analysis of a Deep Learning Algorithm for the Detection of Cervical Spine Fractures

Voter, Andrew F.; Larson, Matthew E.; Garrett, John W.; Yu, John‐Paul J.

doi:10.3174/ajnr.a7179

Cited by 51 publications

(19 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“… 21 , 22 In addition to developing medical training programs, improvement of assessment of scans might be achieved by investing in artificial intelligence (AI) that, when proven to have a high sensitivity, can further support assessment of cervical spine CT by emergency physicians. 23 When emergency physicians reach sufficient diagnostic accuracy (with or without AI), it would yield opportunities and flexibility to advance clinical decision‐making before the final radiologist report becomes available.…”

Section: Discussionmentioning

confidence: 99%

Assessment of cervical spine CT scans by emergency physicians: A comparative diagnostic accuracy study in a non‐clinical setting

et al. 2022

View full text Add to dashboard Cite

Objectives To determine and compare the diagnostic accuracy of assessing injuries on cervical spine computed tomography (CT) scans by trained emergency physicians and radiologists, both in a non‐clinical setting. Methods In this comparative diagnostic accuracy study, 411 cervical spine CT scans, of which 120 contained injuries (fractures and/or dislocations), were divided into 8 subsets. Eight emergency physicians received focused training and assessed 1 subset each before and after training. Four radiologists assessed 2 subsets each. Diagnostic accuracy between both groups was compared. The reference standard used was a multiverified data set, assessed by radiologists, neurosurgeons, and emergency physicians. The neurosurgeons also classified whether an "injury in need of stabilizing therapy" (IST) was present. Results Posttraining, the emergency physicians demonstrated increased sensitivity and specificity for identifying cervical spine injuries compared to pretraining: sensitivity 88% (95% confidence interval [CI] 80% to 93%) versus 80% (95% CI 72% to 87%) and specificity 89% (95% CI 85% to 93%) versus 86% (95% CI 81% to 89%). When comparing the trained emergency physicians to the group of radiologists, no difference in sensitivity was found, 88% (95% CI 80% to 83%); however, the radiologists showed a significantly higher specificity ( P < 0.01): 99% (95% CI 96% to 100%). In the 12% (15 scans) with missed injuries, emergency physicians missed more ISTs than radiologists, 6 versus 4 scans; however, this difference was not significant ( P = 0.45). Conclusion After focused training and in a non‐clinical setting, no significant difference was found between emergency physicians and radiologists in ruling out cervical spine injuries; however, the radiologists achieved a significantly higher specificity.

show abstract

Section: Discussionmentioning

confidence: 99%

Assessment of cervical spine CT scans by emergency physicians: A comparative diagnostic accuracy study in a non‐clinical setting

et al. 2022

View full text Add to dashboard Cite

show abstract

“…For AI-based medical devices, conducting sanity tests can prevent needless harm to the patient and save a considerable resources. However, without sufficiently large, well-annotated datasets, performing analytical validation to determine the root causes that drive AI systems to fail before deployment remains a challenge ( 5 , 35 ). Moreover, after independent testing data is gathered, regulatory organizations advise that the data be used a limited number of times to prevent over-fitting ( 36 ).…”

Section: Methodsmentioning

confidence: 99%

“…Artificially intelligent (AI) computer-aided diagnostic (CAD) systems have the potential to help radiologists on a multitude of tasks, ranging from tumor classification to improved image reconstruction (1)(2)(3)(4). To deploy medical AI systems, it is essential to validate their performance correctly and to understand their weaknesses before being used on patients (5)(6)(7)(8). For AI-based software as a medical device, the gold standard for analytical validation is to assess performance on previously unseen independent datasets (9-12), followed by a clinical validation study.…”

Section: Introductionmentioning

confidence: 99%

Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems

Mahmood

Shrestha

Bates

et al. 2021

Front. Digit. Health

View full text Add to dashboard Cite

Artificial intelligence (AI) has been successful at solving numerous problems in machine perception. In radiology, AI systems are rapidly evolving and show progress in guiding treatment decisions, diagnosing, localizing disease on medical images, and improving radiologists' efficiency. A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety. The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset from one or more institutions, followed by a clinical validation study of the system's efficacy during deployment. Clinical validation studies are time-consuming, and best practices dictate limited re-use of analytical validation data, so it is ideal to know ahead of time if a system is likely to fail analytical or clinical validation. In this paper, we describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons. We illustrate the sanity tests' value by designing a deep learning system to classify pancreatic cancer seen in computed tomography scans.

show abstract

“…All seventeen studies used a CNN to detect and /or classify fractures on CT scans [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28]. Eight studies addressed detection of rib fractures [13,17,19,20,22,[25][26][27], three studies the performance for detection [12,21] and classification [18] of pelvic fractures, four for detection of spine fractures [14,16,23,28], one for detection and classification of femur fractures [24] and one of calcaneal fractures [15]. Fourteen studies used two output classes (fracture yes/no).…”

Section: Description Of Studiesmentioning

confidence: 99%

“…Eight studies used the F1-score to assess performance instead: in two the F1-score was assessed for the classification of healing status [25,26], in one for displacement [21], and in five [13,[18][19][20]22] for the detection of fractures. Additionally, we calculated the F1-scores in three studies [12,23,28] to facilitate comparison. F1-scores ranged from 0.35 in Yacoub et al [23] to 0.94 in Meng et al [20].…”

Section: Primary Outcome: the Performance Of Cnnmentioning

confidence: 99%

Artificial intelligence fracture recognition on computed tomography: review of literature and recommendations

Dankelman

Schilstra²,

Verhofstad³

et al. 2022

Eur J Trauma Emerg Surg

View full text Add to dashboard Cite

Purpose The use of computed tomography (CT) in fractures is time consuming, challenging and suffers from poor inter-surgeon reliability. Convolutional neural networks (CNNs), a subset of artificial intelligence (AI), may overcome shortcomings and reduce clinical burdens to detect and classify fractures. The aim of this review was to summarize literature on CNNs for the detection and classification of fractures on CT scans, focusing on its accuracy and to evaluate the beneficial role in daily practice. Methods Literature search was performed according to the PRISMA statement, and Embase, Medline ALL, Web of Science Core Collection, Cochrane Central Register of Controlled Trials and Google Scholar databases were searched. Studies were eligible when the use of AI for the detection of fractures on CT scans was described. Quality assessment was done with a modified version of the methodologic index for nonrandomized studies (MINORS), with a seven-item checklist. Performance of AI was defined as accuracy, F1-score and area under the curve (AUC). Results Of the 1140 identified studies, 17 were included. Accuracy ranged from 69 to 99%, the F1-score ranged from 0.35 to 0.94 and the AUC, ranging from 0.77 to 0.95. Based on ten studies, CNN showed a similar or improved diagnostic accuracy in addition to clinical evaluation only. Conclusions CNNs are applicable for the detection and classification fractures on CT scans. This can improve automated and clinician-aided diagnostics. Further research should focus on the additional value of CNN used for CT scans in daily clinics.

show abstract

Diagnostic Accuracy and Failure Mode Analysis of a Deep Learning Algorithm for the Detection of Cervical Spine Fractures

Cited by 51 publications

References 23 publications

Assessment of cervical spine CT scans by emergency physicians: A comparative diagnostic accuracy study in a non‐clinical setting

Assessment of cervical spine CT scans by emergency physicians: A comparative diagnostic accuracy study in a non‐clinical setting

Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems

Artificial intelligence fracture recognition on computed tomography: review of literature and recommendations

Contact Info

Product

Resources

About