Method evaluation, parameterization, and result validation in unsupervised data mining: A critical survey

Zimmermann, Albrecht

doi:10.1002/widm.1330

Cited by 14 publications

(15 citation statements)

References 175 publications

(213 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is some work on the so-called benchmarking of clustering methods (Van Mechelen et al, 2018;Zimmermann, 2020). This is different from our approach.…”

Section: Introductionmentioning

confidence: 95%

Validation of cluster analysis results on validation data: A systematic framework

Ullmann¹,

Hennig²,

Boulesteix³

2021

Preprint

View full text Add to dashboard Cite

Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To judge the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic structured framework for validating clustering results on validation data that includes most existing validation approaches. In particular, we review classical validation techniques such as internal and external validation, stability analysis, hypothesis testing, and visual validation, and show how they can be interpreted in terms of our framework.We precisely define and formalise different types of validation of clustering results on a validation dataset and explain how each type can be implemented in practice. Furthermore, we give examples of how clustering studies from the applied literature that used a validation dataset can be classified into the framework.

show abstract

“…There is some work on the so-called benchmarking of clustering methods (Van Mechelen et al, 2018;Zimmermann, 2020). This is different from our approach.…”

Section: Introductionmentioning

confidence: 95%

Validation of cluster analysis results on validation data: A systematic framework

Ullmann¹,

Hennig²,

Boulesteix³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, when conducting cluster analysis, researchers are confronted with an overwhelming number of existing methods. They must preprocess the data, choose a clustering algorithm, and set parameters, such as the number of clusters (Van Mechelen et al, 2018; Zimmermann, 2020). It is often unclear a priori which choice should be made for the analysis, and even once a choice is made, it may remain unclear how good the quality of the resulting clustering is.…”

Section: Introductionmentioning

confidence: 99%

“…The phrase “cluster validation” also appears in the literature about benchmarking of clustering methods (Boulesteix & Hatz, 2017; Van Mechelen et al, 2018; Zimmermann, 2020). A benchmarking study is a systematic comparison of different clustering methods on a class of data distributions or datasets.…”

Section: Introductionmentioning

confidence: 99%

“…But when validation techniques are used as selection tool, it is still an open issue whether the results generalize to new data, and this is addressed by Step 2. The phrase "cluster validation" also appears in the literature about benchmarking of clustering methods (Boulesteix & Hatz, 2017;Van Mechelen et al, 2018;Zimmermann, 2020). A benchmarking study is a systematic comparison of different clustering methods on a class of data distributions or datasets.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Validation of cluster analysis results on validation data: A systematic framework

Ullmann

Hennig

Boulesteix

2021

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To assess the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic, structured review of the existing literature about this topic. For this purpose, we outline a formal framework that covers most existing approaches for validating clustering results on validation data. In particular, we review classical validation techniques such as internal and external validation, stability analysis, and visual validation, and show how they can be interpreted in terms of our framework. We define and formalize different types of validation of clustering results on a validation dataset, and give examples of how clustering studies from the applied literature that used a validation dataset can be seen as instances of our framework. This article is categorized under: Technologies > Structure Discovery and Clustering Algorithmic Development > Statistics Technologies > Machine Learning

show abstract

“…Regarding the appropriate design and analysis of benchmark studies, the available literature ranges from general guidelines (Weber et al, 2019;Boulesteix, 2015) and statistical frameworks (Demšar, 2006;Hothorn et al, 2005;Eugster et al, 2012;Boulesteix et al, 2015, all with focus on supervised learning), to recommendations for context-specific benchmarks (e.g. Mangul et al, 2019;Bokulich et al, 2020;Zimmermann, 2020;Kreutz, 2019). However, for many issues relevant in practice (e.g.…”

Section: Introduction and Related Workmentioning

confidence: 99%

Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results

Nießl,

Herrmann,

Wiedemann

et al. 2021

Preprint

View full text Add to dashboard Cite

In recent years, the need for neutral benchmark studies that focus on the comparison of methods from computational sciences has been increasingly recognised by the scientific community. While general advice on the design and analysis of neutral benchmark studies can be found in recent literature, certain amounts of flexibility always exist. This includes the choice of data sets and performance measures, the handling of missing performance values and the way the performance values are aggregated over the data sets. As a consequence of this flexibility, researchers may be concerned about how their choices affect the results or, in the worst case, may be tempted to engage in questionable research practices (e.g.

show abstract

Method evaluation, parameterization, and result validation in unsupervised data mining: A critical survey

Cited by 14 publications

References 175 publications

Validation of cluster analysis results on validation data: A systematic framework

Validation of cluster analysis results on validation data: A systematic framework

Validation of cluster analysis results on validation data: A systematic framework

Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results

Contact Info

Product

Resources

About