Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis

Rudolph, Jan; Schachtner, B. M.; Fink, Nicola; Koliogiannis, Vanessa; Schwarze, Vincent; Goller, Sophia; Trappmann, Lena; Hoppe, Boj; Mansour, Nabeel; Fischer, Maximilian; Khaled, Najib Ben; Jörgens, Maximilian; Dinkel, Julien; Kunz, Wolfgang G.; Ricke, Jens; Ingrisch, Michael; Sabel, Bastian O.; Rueckel, Johannes

doi:10.1038/s41598-022-16514-7

Cited by 6 publications

(11 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Among the 63 studies, 56 studies identified pneumothorax on chest radiography [ 26 – 81 ], four studies on computed tomography [ 82 – 85 ], one study on ECG [ 86 ], one study used chest radiography and photography using a smartphone [ 87 ], and one study used chest radiography and tabular data [ 88 ]. Six studies developed and internally tuned DLs [ 37 , 52 , 63 , 67 , 74 , 76 ], 25 studies also internally tested their DLs [ 32 , 33 , 35 , 38 , 40 , 41 , 43 , 45 , 47 , 48 , 50 , 55 , 60 , 65 , 69 , 70 , 73 , 75 , 79 – 83 , 85 , 86 ] and 32 studies externally tested the DLs [ 26 – 31 , 34 , 36 , 39 , 42 , 44 , 46 , 49 , 51 , 53 , 54 , 56 – 59 , 61 , 62 , 64 , 66 , 68 , 71 , 72 , 77 , 78 , 84 , 87 , 88 ].…”

Section: Resultsmentioning

confidence: 99%

“…As for model development, to generate a reference standard for image labelling, 18 studies used expert consensus [ 27 – 33 , 35 – 38 , 49 , 53 – 55 , 71 , 77 , 83 ], two relied on the opinion of a single expert reader [ 76 , 85 ], 16 used pre-existing radiological reports or other imaging modalities [ 34 , 41 , 43 , 45 , 46 , 52 , 60 , 61 , 67 , 75 , 78 – 82 , 87 ], one study defined their reference standard as surgical confirmation (indicated for surgery) [ 86 ], 11 studies used mixed methods (any combination of the aforementioned) [ 40 , 47 , 48 , 50 , 51 , 62 , 63 , 65 , 69 , 70 , 73 ] and two studies did not report how their reference standard was generated [ 74 , 88 ]. As for model testing, to generate a reference standard for image labelling, 26 studies used expert consensus [ 26 – 28 , 30 – 33 , 38 , 39 , 44 , 51 , 54 – 57 , 61 , 64 , 66 , 68 , 71 – 73 , 77 , 80 , 83 , 84 ], two relied on the opinion of a single expert reader [ 58 , 85 ], 11 used pre-exist...…”

Section: Resultsmentioning

confidence: 99%

“…We extracted 89 contingency tables from 32 studies that provided sufficient information to calculate contingency tables for pneumothorax classification [ 27 – 36 , 38 , 39 , 42 – 46 , 48 , 51 , 53 – 56 , 58 , 61 , 68 , 70 , 71 , 75 , 78 , 80 , 87 ]. 68 contingency tables were extracted for reported DL performance and 21 contingency tables were extracted for physician performance.…”

Section: Resultsmentioning

confidence: 99%

See 2 more Smart Citations

Deep learning for pneumothorax diagnosis: a systematic review and meta-analysis

Sugibayashi¹,

Walston²,

Matsumoto³

et al. 2023

Eur Respir Rev

View full text Add to dashboard Cite

BackgroundDeep learning (DL), a subset of artificial intelligence (AI), has been applied to pneumothorax diagnosis to aid physician diagnosis, but no meta-analysis has been performed.MethodsA search of multiple electronic databases through September 2022 was performed to identify studies that applied DL for pneumothorax diagnosis using imaging. Meta-analysisviaa hierarchical model to calculate the summary area under the curve (AUC) and pooled sensitivity and specificity for both DL and physicians was performed. Risk of bias was assessed using a modified Prediction Model Study Risk of Bias Assessment Tool.ResultsIn 56 of the 63 primary studies, pneumothorax was identified from chest radiography. The total AUC was 0.97 (95% CI 0.96–0.98) for both DL and physicians. The total pooled sensitivity was 84% (95% CI 79–89%) for DL and 85% (95% CI 73–92%) for physicians and the pooled specificity was 96% (95% CI 94–98%) for DL and 98% (95% CI 95–99%) for physicians. More than half of the original studies (57%) had a high risk of bias.ConclusionsOur review found the diagnostic performance of DL models was similar to that of physicians, although the majority of studies had a high risk of bias. Further pneumothorax AI research is needed.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Deep learning for pneumothorax diagnosis: a systematic review and meta-analysis

Sugibayashi¹,

Walston²,

Matsumoto³

et al. 2023

Eur Respir Rev

View full text Add to dashboard Cite

show abstract

“…The next most comprehensive model, which was capable of detecting 72 findings, demonstrated an average AUC of 0.77 [11]. When compared with physician detection accuracy, the identified devices were typically found to be as accurate, or more accurate, than radiologist or non-radiologist clinicians [11,43,59,63,71,74,[81][82][83]88]. Taking this further, multiple studies demonstrated that use of well-trained and validated deep learning models can improve the clinical finding classification performance of clini-cians when acting as a diagnostic assistance device [42,43,57,62,66,74,83,87].…”

Section: Discussionmentioning

confidence: 99%

Machine Learning Augmented Interpretation of Chest X-rays: A Systematic Review

Ahmad

Milne²,

Buchlak

et al. 2023

Diagnostics

View full text Add to dashboard Cite

Limitations of the chest X-ray (CXR) have resulted in attempts to create machine learning systems to assist clinicians and improve interpretation accuracy. An understanding of the capabilities and limitations of modern machine learning systems is necessary for clinicians as these tools begin to permeate practice. This systematic review aimed to provide an overview of machine learning applications designed to facilitate CXR interpretation. A systematic search strategy was executed to identify research into machine learning algorithms capable of detecting >2 radiographic findings on CXRs published between January 2020 and September 2022. Model details and study characteristics, including risk of bias and quality, were summarized. Initially, 2248 articles were retrieved, with 46 included in the final review. Published models demonstrated strong standalone performance and were typically as accurate, or more accurate, than radiologists or non-radiologist clinicians. Multiple studies demonstrated an improvement in the clinical finding classification performance of clinicians when models acted as a diagnostic assistance device. Device performance was compared with that of clinicians in 30% of studies, while effects on clinical perception and diagnosis were evaluated in 19%. Only one study was prospectively run. On average, 128,662 images were used to train and validate models. Most classified less than eight clinical findings, while the three most comprehensive models classified 54, 72, and 124 findings. This review suggests that machine learning devices designed to facilitate CXR interpretation perform strongly, improve the detection performance of clinicians, and improve the efficiency of radiology workflow. Several limitations were identified, and clinician involvement and expertise will be key to driving the safe implementation of quality CXR machine learning systems.

show abstract

“…Tam et al also reported the improved detection of suspicious pulmonary nodules on CXR with AI-aided interpretation (sensitivities 89–94%) versus unaided reporting interpretation for all three radiologists (sensitivities 69–86%), with a slight increase in false positives and a decrease in specificity [ 18 ]. Another CXR study reported that standalone AI performance for pneumothorax, pleural effusion and lung lesions was similar to that for radiology residents, but was significantly better than the performance of non-radiology residents [ 19 ]. Beyond CXRs, other studies have reported on missed findings of intracranial hemorrhage in noncontract head CT examinations and mammography [ 20 ].…”

Section: Introductionmentioning

confidence: 99%

Frequency of Missed Findings on Chest Radiographs (CXRs) in an International, Multicenter Study: Application of AI to Reduce Missed Findings

et al. 2022

View full text Add to dashboard Cite

Background: Missed findings in chest X-ray interpretation are common and can have serious consequences. Methods: Our study included 2407 chest radiographs (CXRs) acquired at three Indian and five US sites. To identify CXRs reported as normal, we used a proprietary radiology report search engine based on natural language processing (mPower, Nuance). Two thoracic radiologists reviewed all CXRs and recorded the presence and clinical significance of abnormal findings on a 5-point scale (1—not important; 5—critical importance). All CXRs were processed with the AI model (Qure.ai) and outputs were recorded for the presence of findings. Data were analyzed to obtain area under the ROC curve (AUC). Results: Of 410 CXRs (410/2407, 18.9%) with unreported/missed findings, 312 (312/410, 76.1%) findings were clinically important: pulmonary nodules (n = 157), consolidation (60), linear opacities (37), mediastinal widening (21), hilar enlargement (17), pleural effusions (11), rib fractures (6) and pneumothoraces (3). AI detected 69 missed findings (69/131, 53%) with an AUC of up to 0.935. The AI model was generalizable across different sites, geographic locations, patient genders and age groups. Conclusion: A substantial number of important CXR findings are missed; the AI model can help to identify and reduce the frequency of important missed findings in a generalizable manner.

show abstract

Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis

Cited by 6 publications

References 39 publications

Deep learning for pneumothorax diagnosis: a systematic review and meta-analysis

Deep learning for pneumothorax diagnosis: a systematic review and meta-analysis

Machine Learning Augmented Interpretation of Chest X-rays: A Systematic Review

Frequency of Missed Findings on Chest Radiographs (CXRs) in an International, Multicenter Study: Application of AI to Reduce Missed Findings

Contact Info

Product

Resources

About