2022
DOI: 10.1155/2022/5314671
|View full text |Cite
|
Sign up to set email alerts
|

Leakage Prediction in Machine Learning Models When Using Data from Sports Wearable Sensors

Abstract: One of the major problems in machine learning is data leakage, which can be directly related to adversarial type attacks, raising serious concerns about the validity and reliability of artificial intelligence. Data leakage occurs when the independent variables used to teach the machine learning algorithm include either the dependent variable itself or a variable that contains clear information that the model is trying to predict. This data leakage results in unreliable and poor predictive results after the dev… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 19 publications
(13 citation statements)
references
References 48 publications
0
7
0
Order By: Relevance
“…To do this, we would have needed to use all the available data (train and test), and this would have resulted in data leakage, since the test data would influence the final projection. Therefore, to prevent data leakage and the reporting of overly optimistic and potentially misleading results, the generalization performance of the RPCA and PCoA ordination methods was excluded ( 29 ). An overview of the important properties of each ordination method is presented in Table 1 .…”
Section: Resultsmentioning
confidence: 99%
“…To do this, we would have needed to use all the available data (train and test), and this would have resulted in data leakage, since the test data would influence the final projection. Therefore, to prevent data leakage and the reporting of overly optimistic and potentially misleading results, the generalization performance of the RPCA and PCoA ordination methods was excluded ( 29 ). An overview of the important properties of each ordination method is presented in Table 1 .…”
Section: Resultsmentioning
confidence: 99%
“…To do this, we would have needed to use all the available data (train, test) and this would have resulted in data leakage since the test data would influence the final projection. Therefore, to prevent data leakage and the reporting of overly optimistic and potentially misleading results, the generalization performance of RPCA and PCoA ordination methods were excluded (49). An overview of the important properties of each ordination method is presented in Table 1.…”
Section: Resultsmentioning
confidence: 99%
“…When this happens, information about any withheld data is included when the PCoA or RPCA objective function is optimized. This could bias the results and potentially create machine learning models which produce overly optimistic and potentially misleading results (49). With UMAP this is not a problem since UMAP can learn an appropriate transformation using only the training data.…”
Section: Discussionmentioning
confidence: 99%
“…In such same-organ calibration ( SOC ) setups, data leakage may take place wherein the calibration model learns and then imparts information from the held-out D V into the D T , in other words violating the strict separation between D T and D V . Such leakage may subsequently inflate the model testing accuracy, degrading the generalizability of the model ( Chiavegatto Filho et al, 2021 ; Dong, 2022 ; Kaufman et al, 2012 ; Tampu et al, 2022 ). For example, a CycleGAN could impart the task-specific knowledge that BCC cells from an external test site are slightly larger, due to microns per pixel differences in the scanner, into training images by modifying their size.…”
Section: Introductionmentioning
confidence: 99%
“…It was also reported ( Wei et al, 2019 ) that a CycleGAN model can easily render visual attributes of precancerous tissue onto normal tissue inputs, wherein the CycleGAN learned and transferred task-specific features from precancerous tissue templates to the training images. Moreover, Dong et al suggest only preprocessing training data to prevent data leakage, therefore calibration of D T rather than D V is also in favor of reducing data leakage risk ( Dong, 2022 ). Taken together, it stands to reason that a superior calibration approach could help disentangle and thereby learn site-specific pre-analytic variables, while being blinded from task-specific information , potentially contaminating classifier construction.…”
Section: Introductionmentioning
confidence: 99%