The Prevalence of Errors in Machine Learning Experiments

Shepperd, Martin; Guo, Yuchen; Li, Ning; Arzoky, Mahir; Capiluppi, Andrea; Counsell, Steve; Destefanis, Giuseppe; Swift, Stephen; Tucker, Allan; Yousefi, Leila

doi:10.1007/978-3-030-33607-3_12

Cited by 10 publications

(13 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Aggregate the results to provide a summary of the performance so that one can determine if the method has, on average, better performance for a specific single-mode or multi-mode problem. Use formal methods to compare the result (Shepperd et al, 2019) (for example the Bonferroni correction, Benjamini-Hochberg false discovery rate estimate and Nemenyi post hoc procedure).…”

Section: Dealing With Different Experimental Set-upsmentioning

confidence: 99%

“…To avoid leakage, test data set instances must not, under any circumstance, be used for training (see sections 8.3 and 6.2). If many models are generated during hyper-parameter tweaking, then an appropriate method must be used to select the best model (Shepperd et al, 2019) (section 7.3), or alternately, report on the robustness of the models (using for example sensitivity analysis (Cortez & Embrechts, 2013) or at the very least simple aggregates such as median, inter-quartile range, minimum and maximum (Giles & Lawrence, 1997). This avoids a form of data snooping where the selected model has a particularly high performance that is due to chance alone (6.3).…”

Section: Model Fittingmentioning

confidence: 99%

“…Researchers have concluded that many experimental results are unreliable (Shepperd et al, 2019;Ioannidis, 2005). Errors have been found, some of which may be attributed to simple transcription errors.…”

Section: Introductionmentioning

confidence: 99%

“…Errors have been found, some of which may be attributed to simple transcription errors. This is in part due to the complex and chaotic process that includes data pre-processing, feature generation and selection, hyper-parameter tuning via crossvalidation, metric selection and model performance comparison (Shepperd et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

“…Several researchers concerned with this issue have done meta-International Journal of Prognostics and Health Management, ISSN2153-2648, 2020 010 analysis on related work to try and identify what errors are committed and how one may go about avoiding these. This includes studying the validity of null hypothesis significance testing (Colquhoun, 2018), checking simple integrity constraints (arithmetical and statistical errors) (Shepperd et al, 2019) and examining the correctness of the experimental design and analysis (Ioannidis, 2005). In this work we will look at the machine learning experimental design in RUL estimation with an emphasis on the comparative analysis of the generated prediction models.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Remaining Useful Life Estimation of Bearings

Ferreira

Sousa

2021

IJPHM

View full text Add to dashboard Cite

In the domain of predictive maintenance, when trying to repli- cate and compare research in remaining useful life estimation (RUL), several inconsistencies and errors were identified in the experimental methodology used by various researchers. This makes the replication and the comparison of results diffi- cult, thus severely hindering both progress in this research do- main and its practical application to industry. We survey the literature to evaluate the experimental procedures that were used, and identify the most common errors and omission in both experimental procedures and reporting. A total of 70 papers on RUL were audited. From this meta- analysis we estimate that approximately 11% of the papers present work that will allow for replication and comparison. Surprisingly, only about 24.3% (17 of the 70 articles) com- pared their results with previous work. Of the remaining work, 41.4% generated and compared several models of their own and, somewhat unsettling, 31.4% of the researchers made no comparison whatsoever. The remaining 2.9% did not use the same data set for comparisons. The results of this study were also aggregated into 3 categories: problem class selec- tion, model fitting best practices and evaluation best practices. We conclude that model evaluation is the most problematic one. The main contribution of the article is a proposal of an ex- perimental protocol and several recommendations that specif- ically target model evaluation. Adherence to this protocol should substantially facilitate the research and application of RUL prediction models. The goals are to promote the collab- oration between scholars and practitioners alike and advance the research in this domain.

show abstract

Section: Dealing With Different Experimental Set-upsmentioning

confidence: 99%

Section: Model Fittingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%