2019
DOI: 10.1007/978-3-030-33607-3_12
|View full text |Cite
|
Sign up to set email alerts
|

The Prevalence of Errors in Machine Learning Experiments

Abstract: Context: Conducting experiments is central to research machine learning research to benchmark, evaluate and compare learning algorithms. Consequently it is important we conduct reliable, trustworthy experiments. Objective: We investigate the incidence of errors in a sample of machine learning experiments in the domain of software defect prediction. Our focus is simple arithmetical and statistical errors. Method : We analyse 49 papers describing 2456 individual experimental results from a previously undertaken … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(13 citation statements)
references
References 17 publications
0
12
0
Order By: Relevance
“…Aggregate the results to provide a summary of the performance so that one can determine if the method has, on average, better performance for a specific single-mode or multi-mode problem. Use formal methods to compare the result (Shepperd et al, 2019) (for example the Bonferroni correction, Benjamini-Hochberg false discovery rate estimate and Nemenyi post hoc procedure).…”
Section: Dealing With Different Experimental Set-upsmentioning
confidence: 99%
See 4 more Smart Citations
“…Aggregate the results to provide a summary of the performance so that one can determine if the method has, on average, better performance for a specific single-mode or multi-mode problem. Use formal methods to compare the result (Shepperd et al, 2019) (for example the Bonferroni correction, Benjamini-Hochberg false discovery rate estimate and Nemenyi post hoc procedure).…”
Section: Dealing With Different Experimental Set-upsmentioning
confidence: 99%
“…To avoid leakage, test data set instances must not, under any circumstance, be used for training (see sections 8.3 and 6.2). If many models are generated during hyper-parameter tweaking, then an appropriate method must be used to select the best model (Shepperd et al, 2019) (section 7.3), or alternately, report on the robustness of the models (using for example sensitivity analysis (Cortez & Embrechts, 2013) or at the very least simple aggregates such as median, inter-quartile range, minimum and maximum (Giles & Lawrence, 1997). This avoids a form of data snooping where the selected model has a particularly high performance that is due to chance alone (6.3).…”
Section: Model Fittingmentioning
confidence: 99%
See 3 more Smart Citations