2005
DOI: 10.1198/106186005x59630
|View full text |Cite
|
Sign up to set email alerts
|

The Design and Analysis of Benchmark Experiments

Abstract: The assessment of the performance of learners by means of benchmark experiments is an established exercise. In practice, benchmark studies are a tool to compare the performance of several competing algorithms for a certain learning problem. Cross-validation or resampling techniques are commonly used to derive point estimates of the performances which are compared to identify algorithms with good properties. For several benchmarking problems, test procedures taking the variability of those point estimates into … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
141
0

Year Published

2007
2007
2023
2023

Publication Types

Select...
6
1
1

Relationship

3
5

Authors

Journals

citations
Cited by 167 publications
(143 citation statements)
references
References 43 publications
2
141
0
Order By: Relevance
“…In either case, it is important to note that -due to the fact that the number of boostrap-or subsamples drawn form given data sets, and the number of samples drawn from a data generating process in a simulation study is arbitrary -one can detect very small performance differences with very high power when the number of learning samples B is large (see also [11]). …”
Section: Discussionmentioning
confidence: 99%
“…In either case, it is important to note that -due to the fact that the number of boostrap-or subsamples drawn form given data sets, and the number of samples drawn from a data generating process in a simulation study is arbitrary -one can detect very small performance differences with very high power when the number of learning samples B is large (see also [11]). …”
Section: Discussionmentioning
confidence: 99%
“…In the following we extend the benchmarking framework by Hothorn et al (2005) for regression and classification problems ("supervised learning") to the case of cluster analysis ("unsupervised learning"). Let X N = {x 1 , .…”
Section: Bootstrapping Segmentation Algorithmsmentioning
confidence: 99%
“…. , s B }, independence is of great importance, see Hothorn et al (2005). Many questions about stability of the segmentation algorithm can now be formulated in terms of standard statistical inference on S as demonstrated below on several examples.…”
Section: Evaluating Reproducibilitymentioning
confidence: 99%
“…In any case, we feel that it is both more natural and preferable to derive rankings based on the comparisons of performances only, in particular, basing these on a notion of one algorithm A i performing significantly better than another algorithm A j , symbolically, A i > A j . Using the experimental designs of Hothorn et al (2005), "classical" hypothesis tests can be employed for assessing significant deviations in performance.…”
Section: Consensus Rankingsmentioning
confidence: 99%
“…Often, p-values reported for assessing significant difference in the performance of algorithms are rather incorrect (e.g., necessary independence assumptions cannot be guaranteed in commonly employed experimental designs) or potentially misleading (e.g., by solely focusing on the means of performance distributions which can be considerably skewed). Hothorn, Leisch, Zeileis, and Hornik (2005) provide a framework which allows the comparison of algorithms on single data sets based on classical statistical inference procedures, making it possible to test one-sided hypotheses ("Does algorithm A i perform significantly better than algorithm A j on data set D b ?") as well as the hypothesis of non-equivalence.…”
Section: Introductionmentioning
confidence: 99%