2016
DOI: 10.2196/jmir.5870
|View full text |Cite
|
Sign up to set email alerts
|

Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View

Abstract: BackgroundAs more and more researchers are turning to big data for new opportunities of biomedical discoveries, machine learning models, as the backbone of big data analysis, are mentioned more often in biomedical journals. However, owing to the inherent complexity of machine learning methods, they are prone to misuse. Because of the flexibility in specifying machine learning models, the results are often insufficiently reported in research articles, hindering reliable assessment of model validity and consiste… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

6
593
0
1

Year Published

2018
2018
2022
2022

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 666 publications
(609 citation statements)
references
References 51 publications
6
593
0
1
Order By: Relevance
“…A common mistake in any method where the dataset is split into training and test sets is to allow data leakage to occur [37]. This refers to using any data or information during model generation that is not part of the training set and can result in overfitting and overly optimistic model performance.…”
Section: Techniques For Internal Validationmentioning
confidence: 99%
“…A common mistake in any method where the dataset is split into training and test sets is to allow data leakage to occur [37]. This refers to using any data or information during model generation that is not part of the training set and can result in overfitting and overly optimistic model performance.…”
Section: Techniques For Internal Validationmentioning
confidence: 99%
“…Machine learning and survival analyses were performed in R (39) using the Caret package and the coxph R function. The developing of the predictive models and the reporting of the findings were in accordance with an EQUATOR Guideline for reporting machine learning predictive models (40) and STARD 2015 checklist (41).…”
Section: Discussionmentioning
confidence: 99%
“…This is especially important because the pre-360 processing steps allowed the same cell to be found multiple times but with different training 361 labels (Figure 3a). We found that if we did not split this way, we would have indirect data 362 leakage (Luo et al, 2016)…”
Section: Considerations For Weak Supervision 340mentioning
confidence: 98%
“…We also sought to determine what our scores would have looked like if data leakage occurred 389 during the location prediction stage. In machine learning and statistics, data leakage can lead to 390 inflated performance estimates when data from the validation or test set are used during training 391 (Luo et al, 2016). Overfitting because of data leakage would have been easy to do by mistake 392 because the provided binarized expression data, generated by DistMap, were produced using all 393 expression data and consequently should never be used at any step of training or testing.…”
Section: Measuring and Avoiding Data Leakage During Location Predictionmentioning
confidence: 99%