“…Although the evaluation metrics used differed considerably across image analysis tasks and studies making direct comparisons challenging (Table II), there was a clear performance improvement when Transf/Attention mechanisms were used across studies. Some of the studies demonstrated either large (≥ 5%) differences against the best baseline models [21,35,46,79,101,108,117,121,122,126,127,135], or moderate (<5%) but consistent improvements across different metrics evaluated [13,18,39,53,54,56,57,62,70,78,91,94,105] and/ or data used [98,100,103,105,108]. In the following paragraphs, we detail studies that followed our 2 objective generalisation criteria (see Methods): whether a model was a) trained on large data (>2,000 images, Table I) and/ or b) analysed data from heterogeneous modalities, and/ or multiple modalities and/ or multiple organ areas and/ or multiple datasets of the same modality and organ.…”