Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View

Luo, Wei; Phung, Dinh; Tran, Truyen; Gupta, Sunil; Rana, Santu; Karmakar, Chandan; Shilton, Alistair; Yearwood, John; Dimitrova, Nevenka; Ho, Tu Bao; Venkatesh, Svetha; Berk, Michael

doi:10.2196/jmir.5870

Cited by 666 publications

(609 citation statements)

References 51 publications

Supporting

Mentioning

593

Contrasting

Unclassified

Order By: Relevance

“…A common mistake in any method where the dataset is split into training and test sets is to allow data leakage to occur [37]. This refers to using any data or information during model generation that is not part of the training set and can result in overfitting and overly optimistic model performance.…”

Section: Techniques For Internal Validationmentioning

confidence: 99%

Prediction Modeling Methodology

Dankers

Traverso

Wee

et al. 2018

Fundamentals of Clinical Data Science

View full text Add to dashboard Cite

A statistical hypothesis is a statement that can be tested by collecting data and making observations. Before you start data collection and perform your research, you need to formulate your hypothesis. An example hypothesis could be for instance: "If I increase the prescribed radiation dose to the tumor, this will also lead to an increase of side-effects in surrounding healthy tissues". The purpose of statistical hypothesis testing is to find out whether the observations are meaningful or can be attributed to noise or chance.

show abstract

Section: Techniques For Internal Validationmentioning

confidence: 99%

Prediction Modeling Methodology

Dankers

Traverso

Wee

et al. 2018

Fundamentals of Clinical Data Science

View full text Add to dashboard Cite

show abstract

“…Machine learning and survival analyses were performed in R (39) using the Caret package and the coxph R function. The developing of the predictive models and the reporting of the findings were in accordance with an EQUATOR Guideline for reporting machine learning predictive models (40) and STARD 2015 checklist (41).…”

Section: Discussionmentioning

confidence: 99%

Prediction of Lymph Node Metastasis in Breast Cancer by Gene Expression and Clinicopathological Models: Development and Validation within a Population-Based Cohort

Dihge

Vallon‐Christersson

Hegardt

et al. 2019

Clinical Cancer Research

View full text Add to dashboard Cite

Purpose: More than 70% of patients with breast cancer present with node-negative disease, yet all undergo surgical axillary staging. We aimed to define predictors of nodal metastasis using clinicopathological characteristics (CLINICAL), gene expression data (GEX), and mixed features (MIXED) and to identify patients at low risk of metastasis who might be spared sentinel lymph node biopsy (SLNB).Experimental Design: Breast tumors (n ¼ 3,023) from the population-based Sweden Cancerome Analysis Network-Breast initiative were profiled by RNA sequencing and linked to clinicopathologic characteristics. Seven machine-learning models present the discriminative ability of N0/Nþ in development (n ¼ 2,278) and independent validation cohorts (n ¼ 745) stratified as ER þ HER2 À , HER2 þ , and TNBC. Possible SLNB reduction rates are proposed by applying CLINICAL and MIXED predictors.Results: In the validation cohort, the MIXED predictor showed the highest area under ROC curves to assess nodal metastasis; AUC ¼ 0.72. For the subgroups, the AUCs for MIXED, CLINICAL, and GEX predictors ranged from 0.66 to 0.72, 0.65 to 0.73, and 0.58 to 0.67, respectively. Enriched proliferation metagene and luminal B features were noticed in node-positive ER þ HER2 À and HER2 þ tumors, while upregulated basal-like features were observed in node-negative TNBC tumors. The SLNB reduction rates in patients with ER þ HER2 À tumors were 6% to 7% higher for the MIXED predictor compared with the CLINICAL predictor accepting false negative rates of 5% to 10%.Conclusions: Although CLINICAL and MIXED predictors of nodal metastasis had comparable accuracy, the MIXED predictor identified more node-negative patients. This translational approach holds promise for development of classifiers to reduce the rates of SLNB for patients at low risk of nodal involvement. a Values are median. b According to the TNM classification for breast cancer, seventh edition. c Percentages are calculated from the total number of patients who received adjuvant treatment. d Percentages are calculated from the total number of patients within the age range (40-74 years) for mammography within the National Breast Cancer Screening Programme.Dihge et al.Abbreviations: FN, false negative; FP, false positive; Max NPV, maximum negative predictive value; Nþ, axillary lymph node metastasis; No, no regional lymph node metastasis; SLNB, sentinel lymph node biopsy; TN, true negative; TP, true positive. Dihge et al.

show abstract

“…This is especially important because the pre-360 processing steps allowed the same cell to be found multiple times but with different training 361 labels (Figure 3a). We found that if we did not split this way, we would have indirect data 362 leakage (Luo et al, 2016)…”

Section: Considerations For Weak Supervision 340mentioning

confidence: 98%

“…We also sought to determine what our scores would have looked like if data leakage occurred 389 during the location prediction stage. In machine learning and statistics, data leakage can lead to 390 inflated performance estimates when data from the validation or test set are used during training 391 (Luo et al, 2016). Overfitting because of data leakage would have been easy to do by mistake 392 because the provided binarized expression data, generated by DistMap, were produced using all 393 expression data and consequently should never be used at any step of training or testing.…”

Section: Measuring and Avoiding Data Leakage During Location Predictionmentioning

confidence: 99%

Machine Learning Approaches Identify Genes Containing Spatial Information from Single-Cell Transcriptomics Data

Loher

Karathanasis

2019

Preprint

View full text Add to dashboard Cite

16Motivation: We participated in the DREAM Single Cell Transcriptomics Challenge. The 17 challenge's focus was two-fold; a) to identify the top 60, 40 and 20 genes that contain the most 18 spatial information, and b) to reconstruct the 3-D arrangement of the D. melanogaster embryo 19 using information from those genes. 20Results: We developed two independent approaches, leveraging machine learning models from 21 Lasso and Deep Neural Networks, that we successfully apply to high-dimensional single-cell 22 sequencing data. Our methods allowed us to achieve top performance when compared to the 23 ground truth. Among ~40 participating teams, the resulting solutions placed 10th, 6th, and 4th in 24 the three DREAM sub-challenges #1, #2 and #3, respectively. Notably, for the Lasso approach 25 we introduced a feature selection technique, Lasso-TopX, that allows a user to define a specific 26 number of features they are interested in and the Neural Network approach utilizes weak 27 supervision for linear regression to accommodate for uncertain or probabilistic training labels. 28 Furthermore, we identified novel D. melanogaster genes that carry important positional 29 information and were not previously suspected. Lastly, we show how the indirect use of the full 30 datasets' information can lead to data leakage and generate bias in overestimating the model's 31 performance. 32 Availability: https://github.com/TJU-CMC-Org/SingleCell-DREAM/. 33 Contact: Nestoras.Karathanasis@jefferson.edu 34 35 36 48melanogaster embryo as a model system and seek to determine whether one can reconstruct the 49 spatial arrangement of cells from a stage 6 embryo by only using a limited number of genes. The 50 challenge piggy backed off previously published scRNA-seq datasets and a computational 51 mapping strategy called DistMap, that leveraged in-situ hybridization data from 84 genes of the 52 Berkeley Drosophila Transcription Network Project (BDTNP), which was shown to uniquely 53 classify almost every position of the D. melanogaster embryo (Karaiskos et al., 2017). Out of 54 these 84 genes (herein referred to as "inSitu genes") and without using hybridization data, the 55 participants were asked to identify the most informative 60, 40, and 20 genes for subchallenges 56 #1, #2, and #3 respectively. In addition to gene selection, each subchallenge also required 57 participants to submit 10 locations predictions (X, Y, Z coordinates) for each of the cells using 58 only the selected genes (Tanevski et al., 2019) . 59• Spatial coordinates: X, Y, and Z coordinates were supplied for the 3039 locations of 97 the D. melanogaster embryo. 98• Single cell RNA sequencing: Three expression tables were provided; the raw, 99 normalized, and binarized expression of 8924 genes across 1297 cells (Karaiskos et 100 al., 2017). 101• DistMap source code was provided and it was used to identify the cell locations in the 102 initial publication (Karaiskos et al., 2017). the spatial coordinates 110Briefly, DistMap calculates several parameters, a quantile value a...

show abstract

Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View

Cited by 666 publications

References 51 publications

Prediction Modeling Methodology

Prediction Modeling Methodology

Prediction of Lymph Node Metastasis in Breast Cancer by Gene Expression and Clinicopathological Models: Development and Validation within a Population-Based Cohort

Machine Learning Approaches Identify Genes Containing Spatial Information from Single-Cell Transcriptomics Data

Contact Info

Product

Resources

About