Leakage Prediction in Machine Learning Models When Using Data from Sports Wearable Sensors

Qi-zheng, Dong

doi:10.1155/2022/5314671

Cited by 19 publications

(13 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To do this, we would have needed to use all the available data (train and test), and this would have resulted in data leakage, since the test data would influence the final projection. Therefore, to prevent data leakage and the reporting of overly optimistic and potentially misleading results, the generalization performance of the RPCA and PCoA ordination methods was excluded ( 29 ). An overview of the important properties of each ordination method is presented in Table 1 .…”

Section: Resultsmentioning

confidence: 99%

Decision Tree Ensembles Utilizing Multivariate Splits Are Effective at Investigating Beta Diversity in Medically Relevant 16S Amplicon Sequencing Data

et al. 2023

View full text Add to dashboard Cite

show abstract

Section: Resultsmentioning

confidence: 99%

Decision Tree Ensembles Utilizing Multivariate Splits Are Effective at Investigating Beta Diversity in Medically Relevant 16S Amplicon Sequencing Data

et al. 2023

View full text Add to dashboard Cite

show abstract

“…To do this, we would have needed to use all the available data (train, test) and this would have resulted in data leakage since the test data would influence the final projection. Therefore, to prevent data leakage and the reporting of overly optimistic and potentially misleading results, the generalization performance of RPCA and PCoA ordination methods were excluded (49). An overview of the important properties of each ordination method is presented in Table 1.…”

Section: Resultsmentioning

confidence: 99%

“…When this happens, information about any withheld data is included when the PCoA or RPCA objective function is optimized. This could bias the results and potentially create machine learning models which produce overly optimistic and potentially misleading results (49). With UMAP this is not a problem since UMAP can learn an appropriate transformation using only the training data.…”

Section: Discussionmentioning

confidence: 99%

Decision Tree Ensembles Utilizing Multivariate Splits Are Effective at Investigating Beta-Diversity in Medically Relevant 16S Amplicon Sequencing Data

Rudar

Golding

Kremer

et al. 2022

Preprint

View full text Add to dashboard Cite

Canonical distance and dissimilarity measures can fail to capture important relationships in high-throughput sequencing datasets since these measurements are unable to represent feature interactions. By learning a dissimilarity using decision tree ensembles, we can avoid this important pitfall. We used 16S rRNA data from the lumen and mucosa of the distal and proximal human colon and the stool of patients suffering from immune-mediated inflammatory diseases and compared how well the Jaccard and Aitchison metrics preserve the pairwise relationships between samples to dissimilarities learned using Random Forests, Extremely Randomized Trees, and LANDMark. We found that dissimilarities learned by unsupervised LANDMark models were better at capturing differences between communities in each set dataset. For example, differences in the microbial communities of colon's distal lumen and mucosa were better reflected using LANDMark dissimilarity (p ≤ 0.001, R2 = 0.476) than using the Jaccard distance (p ≤ 0.001, R2 = 0.313) or Random Forest dissimilarity (p ≤ 0.001, R2 = 0.237). In addition, applying Uniform Manifold Approximation and Projection to dissimilarity matrices and transforming the result using principal components analysis created two-dimensional projections that captured the main axes of variation while also preserving the pairwise distances between samples (eg: ρ = 0.8804, p ≤ 0.001 for the distal colon dissimilarities). Finally, supervised LANDMark models tend to outperform both Random Forest and Extremely Randomized Tree classifiers. Models employing multivariate splits can improve the analysis of complex high-throughput sequencing datasets. The improvements observed in this work likely result from the ability of these models to reduce noise from uninformative features. In an unsupervised setting, LANDMark models can preserve pairwise relationships between samples. When used in a supervised manner, these methods tend to learn a decision boundary that is more reflective of the biological variation within the dataset.

show abstract

“…In such same-organ calibration ( SOC ) setups, data leakage may take place wherein the calibration model learns and then imparts information from the held-out D V into the D T , in other words violating the strict separation between D T and D V . Such leakage may subsequently inflate the model testing accuracy, degrading the generalizability of the model ( Chiavegatto Filho et al, 2021 ; Dong, 2022 ; Kaufman et al, 2012 ; Tampu et al, 2022 ). For example, a CycleGAN could impart the task-specific knowledge that BCC cells from an external test site are slightly larger, due to microns per pixel differences in the scanner, into training images by modifying their size.…”

Section: Introductionmentioning

confidence: 99%

“…It was also reported ( Wei et al, 2019 ) that a CycleGAN model can easily render visual attributes of precancerous tissue onto normal tissue inputs, wherein the CycleGAN learned and transferred task-specific features from precancerous tissue templates to the training images. Moreover, Dong et al suggest only preprocessing training data to prevent data leakage, therefore calibration of D T rather than D V is also in favor of reducing data leakage risk ( Dong, 2022 ). Taken together, it stands to reason that a superior calibration approach could help disentangle and thereby learn site-specific pre-analytic variables, while being blinded from task-specific information , potentially contaminating classifier construction.…”

Section: Introductionmentioning

confidence: 99%

Multi-site cross-organ calibrated deep learning (MuSClD): Automated diagnosis of non-melanoma skin cancer

Zhou

Koyuncu

et al. 2023

Medical Image Analysis

View full text Add to dashboard Cite

Leakage Prediction in Machine Learning Models When Using Data from Sports Wearable Sensors

Cited by 19 publications

References 48 publications

Decision Tree Ensembles Utilizing Multivariate Splits Are Effective at Investigating Beta Diversity in Medically Relevant 16S Amplicon Sequencing Data

Decision Tree Ensembles Utilizing Multivariate Splits Are Effective at Investigating Beta Diversity in Medically Relevant 16S Amplicon Sequencing Data

Decision Tree Ensembles Utilizing Multivariate Splits Are Effective at Investigating Beta-Diversity in Medically Relevant 16S Amplicon Sequencing Data

Multi-site cross-organ calibrated deep learning (MuSClD): Automated diagnosis of non-melanoma skin cancer

Contact Info

Product

Resources

About