AI slipping on tiles: data leakage in digital pathology

Bussola, Nicole; Marcolini, Alessia; Maggio, Valerio; Jurman, Giuseppe; Furlanello, Cesare

doi:10.48550/arxiv.1909.06539

Cited by 2 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, exit wavefunctions were simulated with CIFs from different journals that were concatenated or numbered sequentially. There is leakage 70,71 between training, validation and test sets due to overlap between materials published in different journals and between different scientists' work. However, further leakage can be minimized by selecting dataset partitions before any shuffling and, for wavefunctions, by ensuring that wavefunctions simulated for each journal are not split between partitions.…”

Section: Discussionmentioning

confidence: 99%

Warwick Electron Microscopy Datasets

Ede

2020

Preprint

View full text Add to dashboard Cite

Large, carefully partitioned datasets are essential to train neural networks and standardize performance benchmarks. As a result, we have set up a new dataserver to make University of Warwick electron microscopy datasets available to the wider community. There are three main datasets containing 19769 scanning transmission electron micrographs, 17266 transmission electron micrographs, and 98340 simulated exit wavefunctions, with multiple variants of each dataset for different applications. Each dataset is visualized by t-distributed stochastic neighbour embedding, and we have created interactive visualization tools.

show abstract

Section: Discussionmentioning

confidence: 99%

Warwick Electron Microscopy Datasets

Ede

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…In a previous study [1], we reported that data leakage contaminated nearly half of the studies using a CNN on T1w MRI for the diagnosis of Alzheimer's disease. Other studies using deep learning in the health domain also mention that data leakage pollutes their field of application: [2] in breast cancer detection from mammograms, [3] for Covid-19 diagnosis from chest radiography and [4] for image classification in digital pathology. Finally, [5] quantified the di↵erence between a biased and a right split between train and test sets on the test accuracy for several tasks using neuroimaging data.…”

Section: Data Leakage Handlingmentioning

confidence: 99%

“…This statement also applies to computer-aided diagnosis systems in which convolutional neural networks (CNNs) are widely used to provide a diagnosis or predict the future state of patients from neuroimaging data. Unfortunately, this recent massive use of deep learning has also been associated with methodological flaws in many studies [1,2,3,4,5]. Such studies overestimate the performance of their network in performing classification because their test set (when it exists) is contaminated by data leakage.…”

Section: Introductionmentioning

confidence: 99%