“…Mainly inspired by the clinical fact that radiologists need several adjacent slices for locating and diagnosing lesions on one CT slice, most existing ULD methods take several adjacent 2D CT slices as the inputs to a 2D network architecture [3, 4, 6-10, 12, 15-18] or directly adopt 3D network designs [10] that take 3D volume as input to extract more 3D context information. While both 2D and 3D methods have yielded great A3D [16] A3D+SATr Input CT slices cBM [17] cBM+ SATr ULD performances, the multi-slice-input based 2D detection methods are much more popular than pure 3D fashion because 2D networks benefit from robust 2D models pretrained from large-scale data whereas publicly available 3D medical datasets are not large enough for robust 3D pretraining. While achieving success in ULD, the multi-slice-input based 2D approaches have inherent limitations: (i) Weak global context modeling within each slice.…”