2021
DOI: 10.1093/gigascience/giab055
|View full text |Cite
|
Sign up to set email alerts
|

Preventing dataset shift from breaking machine-learning biomarkers

Abstract: Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new individuals. Dataset shifts are frequent in biomedical research, e.g.,  because of recruitment biases. When a dataset sh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
28
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3
1

Relationship

1
9

Authors

Journals

citations
Cited by 56 publications
(35 citation statements)
references
References 53 publications
(58 reference statements)
0
28
0
Order By: Relevance
“…As the researcher may be unaware of the corresponding dataset bias is can lead to important that shortcomings of the study. Dataset bias occurs when the data used to build the decision model (the training data), has a different distribution than the data on which it should be applied 17 (the test data). To assess clinically-relevant predictions, the test data must match the actual target population, rather than be a random subset of the same data pool as the train data, the common practice in machine-learning studies.…”
Section: Data An Imperfect Window On the Clinicmentioning
confidence: 99%
“…As the researcher may be unaware of the corresponding dataset bias is can lead to important that shortcomings of the study. Dataset bias occurs when the data used to build the decision model (the training data), has a different distribution than the data on which it should be applied 17 (the test data). To assess clinically-relevant predictions, the test data must match the actual target population, rather than be a random subset of the same data pool as the train data, the common practice in machine-learning studies.…”
Section: Data An Imperfect Window On the Clinicmentioning
confidence: 99%
“…A major concern in neuroimaging research is the effect of site on the generalizability of ML models (Dockes et al, 2021; Solanes et al, 2021). Sites may differ in terms of scanner infrastructure, acquisition protocols and neuroimaging feature extraction pipelines as well as sample composition.…”
Section: Discussionmentioning
confidence: 99%
“…This usually requires collecting a big set of patient data (~millions of samples)—both clinical histories and biochemical measurements for the biomarker panel (Swan et al 2015 ), which, at present, is prohibitive to generating such models (Krassowski et al 2020 ). However, recent developments in machine learning, for example, utilising a Bayesian interface (Polson and Sokolov 2017 ), mean that it is possible to train the models with datasets that are an order of magnitude smaller (Assawamakin et al 2013 ; Zhang and Ling 2018 ; Dockès et al 2021 ; Ko et al 2021 ).…”
Section: Checkpoint Inhibitor Genes As Biomarkers For Cancer Clinical...mentioning
confidence: 99%