Aim
To assess the generalisability of a deep learning (DL) system for screening mammography developed at New York University (NYU), USA (1,2) in a South Australian (SA) dataset.
Methods and Materials
Clients with pathology-proven lesions (n=3,160) and age-matched controls (n=3,240) were selected from women screened at BreastScreen SA from January 2010 to December 2016 (n clients=207,691) and split into training, validation and test subsets (70\%, 15\%, 15\% respectively). The primary outcome was area under the curve (AUC), in the SA Test Set 1 (SATS1), differentiating invasive breast cancer or ductal carcinoma in situ (n=469) from age-matched controls (n=490) and benign lesions (n=44). The NYU system was tested statically, after training without transfer learning (TL), after retraining with TL and without (NYU1) and with (NYU2) heatmaps.
Results
The static NYU1 model AUCs in the NYU test set (NYTS) and SATS1 were 83.0\%(95\%CI=82.4\%-83.6\%)(2) and 75.8\%(95\%CI=72.6\%-78.8\%), respectively. Static NYU2 AUCs in the NYTS and SATS1 were 88.6\%(95\%CI=88.3\%-88.9\%)(2) and 84.5\%(95\%CI=81.9\%-86.8\%), respectively. Training of NYU1 and NYU2 without TL achieved AUCs in the SATS1 of 65.8\% (95\%CI=62.2\%-69.1\%) and 85.9\%(95\%CI=83.5\%-88.2\%), respectively. Retraining of NYU1 and NYU2 with TL resulted in AUCs of 82.4\%(95\%CI=79.7-84.9\%) and 86.3\%(95\%CI=84.0-88.5\%) respectively.
Conclusion
We did not fully reproduce the reported performance of NYU on a local dataset; local retraining with TL approximated this level of performance. Optimising models for local clinical environments may improve performance. The generalisation of DL systems to new environments may be challenging.