2019
DOI: 10.48550/arxiv.1909.06539
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

AI slipping on tiles: data leakage in digital pathology

Abstract: Bioinformatics of high throughput omics data (e.g. microarrays and proteomics) has been plagued by uncountable issues with reproducibility at the start of the century. Concerns have motivated international initiatives such as the FDA's led MAQC Consortium, addressing reproducibility of predictive biomarkers by means of appropriate Data Analysis Plans (DAPs). For instance, repreated cross-validation is a standard procedure meant at mitigating the risk that information from held-out validation data may be used d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 18 publications
0
3
0
Order By: Relevance
“…Similarly, exit wavefunctions were simulated with CIFs from different journals that were concatenated or numbered sequentially. There is leakage 70,71 between training, validation and test sets due to overlap between materials published in different journals and between different scientists' work. However, further leakage can be minimized by selecting dataset partitions before any shuffling and, for wavefunctions, by ensuring that wavefunctions simulated for each journal are not split between partitions.…”
Section: Discussionmentioning
confidence: 99%
“…Similarly, exit wavefunctions were simulated with CIFs from different journals that were concatenated or numbered sequentially. There is leakage 70,71 between training, validation and test sets due to overlap between materials published in different journals and between different scientists' work. However, further leakage can be minimized by selecting dataset partitions before any shuffling and, for wavefunctions, by ensuring that wavefunctions simulated for each journal are not split between partitions.…”
Section: Discussionmentioning
confidence: 99%
“…In a previous study [1], we reported that data leakage contaminated nearly half of the studies using a CNN on T1w MRI for the diagnosis of Alzheimer's disease. Other studies using deep learning in the health domain also mention that data leakage pollutes their field of application: [2] in breast cancer detection from mammograms, [3] for Covid-19 diagnosis from chest radiography and [4] for image classification in digital pathology. Finally, [5] quantified the di↵erence between a biased and a right split between train and test sets on the test accuracy for several tasks using neuroimaging data.…”
Section: Data Leakage Handlingmentioning
confidence: 99%
“…This statement also applies to computer-aided diagnosis systems in which convolutional neural networks (CNNs) are widely used to provide a diagnosis or predict the future state of patients from neuroimaging data. Unfortunately, this recent massive use of deep learning has also been associated with methodological flaws in many studies [1,2,3,4,5]. Such studies overestimate the performance of their network in performing classification because their test set (when it exists) is contaminated by data leakage.…”
Section: Introductionmentioning
confidence: 99%