Stability-Based Validation of Clustering Solutions

Lange, Tilman; Röth, Volker; Braun, Mikio L.; Buhmann, Joachim M.

doi:10.1162/089976604773717621

Cited by 418 publications

(342 citation statements)

References 17 publications

Supporting

Mentioning

320

Contrasting

Unclassified

Order By: Relevance

“…Examples are Ben-Hur et al (2002), Bryan (2004), Dudoit and Fridlyand (2002), Grün and Leisch (2004), Lange et al (2004), Monti et al (2001) and Tibshirani and Walther (2005). Many of these papers use stability or prediction strength measurements as a tool to estimate the true number of clusters.…”

Section: Introductionmentioning

confidence: 99%

Cluster-wise assessment of cluster stability

Hennig

2007

Computational Statistics & Data Analysis

596

502

View full text Add to dashboard Cite

Stability in cluster analysis is strongly dependent on the data set, especially on how well separated and how homogeneous the clusters are. In the same clustering, some clusters may be very stable and others may be extremely unstable.The Jaccard coefficient, a similarity measure between sets, is used as a clusterwise measure of cluster stability, which is assessed by the bootstrap distribution of the Jaccard coefficient for every single cluster of a clustering compared to the most similar cluster in the bootstrapped data sets. This can be applied to very general cluster analysis methods.Some alternative resampling methods are investigated as well, namely subsetting, jittering the data points and replacing some data points by artificial noise points. The different methods are compared by means of a simulation study.A data example illustrates the use of the cluster-wise stability assessment to distinguish between meaningful stable and spurious clusters, but it is also shown that clusters are sometimes only stable because of the inflexibility of certain clustering methods.

show abstract

Section: Introductionmentioning

confidence: 99%

Cluster-wise assessment of cluster stability

Hennig

2007

Computational Statistics & Data Analysis

596

502

View full text Add to dashboard Cite

show abstract

“…Existing algorithms include stability-based methods [5,6], model-fitting-based algorithms [7], and methods based on Clustering Validity Indices (CVI) [1]. A CVI is a measure derived from the obtained clustering solution, which quantifies such properties of a clustering solution as compactness, separation between clusters, etc.…”

Section: Introductionmentioning

confidence: 99%

Estimation of the Number of Clusters Using Multiple Clustering Validity Indices

Kryszczuk

Hurley

2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. One of the challenges in unsupervised machine learning is finding the number of clusters in a dataset. Clustering Validity Indices (CVI) are popular tools used to address this problem. A large number of CVIs have been proposed, and reports that compare different CVIs suggest that no single CVI can always outperform others. Following suggestions found in prior art, in this paper we formalize the concept of using multiple CVIs for cluster number estimation in the framework of multi-classifier fusion. Using a large number of datasets, we show that decision-level fusion of multiple CVIs can lead to significant gains in accuracy in estimating the number of clusters, in particular for highdimensional datasets with large number of clusters.

show abstract

“…We run the sampler with a number of clusters varying from 1 to 10 each for 10 different random initializations. We compare the transfer costs with the instability measure proposed in [15]. The results are summarized in Figure 5.…”

Section: Minimum Transfer Costs For Non-factorial Modelsmentioning

confidence: 99%

The Minimum Transfer Cost Principle for Model-Order Selection

Frank

Chehreghani

Buhmann

2011

Machine Learning and Knowledge Discovery in Databases

Self Cite

View full text Add to dashboard Cite

Abstract. The goal of model-order selection is to select a model variant that generalizes best from training data to unseen test data. In unsupervised learning without any labels, the computation of the generalization error of a solution poses a conceptual problem which we address in this paper. We formulate the principle of "minimum transfer costs" for model-order selection. This principle renders the concept of cross-validation applicable to unsupervised learning problems. As a substitute for labels, we introduce a mapping between objects of the training set to objects of the test set enabling the transfer of training solutions. Our method is explained and investigated by applying it to well-known problems such as singular-value decomposition, correlation clustering, Gaussian mixturemodels, and k-means clustering. Our principle finds the optimal model complexity in controlled experiments and in real-world problems such as image denoising, role mining and detection of misconfigurations in access-control data.

show abstract

Stability-Based Validation of Clustering Solutions

Cited by 418 publications

References 17 publications

Cluster-wise assessment of cluster stability

Cluster-wise assessment of cluster stability

Estimation of the Number of Clusters Using Multiple Clustering Validity Indices

The Minimum Transfer Cost Principle for Model-Order Selection

Contact Info

Product

Resources

About