Alejandro Murua scite author profile

Many clustering methods, such as K -means, kernel K -means, and MNcut clustering, follow the same recipe: (i) choose a measure of similarity between observations; (ii) define a figure of merit assigning a large value to partitions of the data that put similar observations in the same cluster; and (iii) optimize this figure of merit over partitions. Potts model clustering represents an interesting variation on this recipe. Blatt, Wiseman, and Domany defined a new figure of merit for partitions that is formally similar to the Hamiltonian of the Potts model for ferromagnetism, extensively studied in statistical physics. For each temperature T , the Hamiltonian defines a distribution assigning a probability to each possible configuration of the physical system or, in the language of clustering, to each partition. Instead of searching for a single partition optimizing the Hamiltonian, they sampled a large number of partitions from this distribution for a range of temperatures. They proposed a heuristic for choosing an appropriate temperature and from the sample of partitions associated with this chosen temperature, they then derived what we call a consensus clustering: two observations are put in the same consensus cluster if they belong to the same cluster in the majority of the random partitions. In a sense, the consensus clustering is an "average" of plausible configurations, and we would expect it to be more stable (over different samples) than the configuration optimizing the Hamiltonian.The goal of this article is to contribute to the understanding of Potts model clustering and to propose extensions and improvements: (1) We show that the Hamiltonian used in Potts model clustering is closely related to the kernel K -means and MNCut criteria. (2) We propose a modification of the Hamiltonian penalizing unequal cluster sizes and show that it can be interpreted as a weighted version of the kernel K -means criterion. (3) We introduce a new version of the Wolff algorithm to simulate configurations from the distribution defined by the penalized Hamiltonian, leading to penalized Potts model clustering. (4) We note a link between kernel based clustering methods and nonparametric density estimation and exploit it to automatically determine locally adaptive kernel bandwidths. (5) We propose a new simple rule for selecting a good temperature T .As an illustration we apply Potts model clustering to gene expression data and compare our results to those obtained by model based clustering and a nonparametric dendrogram sharpening method.

show abstract

The Conditional-Potts Clustering Model

Murua¹,

Wicker²

2014

Journal of Computational and Graphical Statistics

View full text Add to dashboard Cite

This article presents a Bayesian kernel-based clustering method. The associated model arises as an embedding of the Potts density for class membership probabilities into an extended Bayesian model for joint data and class membership probabilities. The method may be seen as a principled extension of the super-paramagnetic clustering. The model depends on two parameters: the temperature and the kernel bandwidth. The clustering is obtained from the posterior marginal adjacency membership probabilities and does not depend on any particular value of the parameters. We elicit an informative prior based on random graph theory and kernel density estimation. A stochastic population Monte Carlo algorithm, based on parallel runs of the Wang-Landau algorithm, is developed to estimate the posterior adjacency membership probabilities and the parameter posterior. The convergence of the algorithm is also established. The method is applied to the whole human proteome to uncover human genes that share common evolutionary history. Our experiments and application show that good clustering results are obtained at many different values of the temperature and bandwidth parameters. Hence, instead of focusing on finding adequate values of the parameters, we advocate making clustering inference based on the study of the distribution of the posterior adjacency membership probabilities. This article has online supplementary material.

show abstract

Probabilistic segmentation and intensity estimation for microarray images

Gottardo

Besag

Stephens

et al. 2005

View full text Add to dashboard Cite

We describe a probabilistic approach to simultaneous image segmentation and intensity estimation for complementary DNA microarray experiments. The approach overcomes several limitations of existing methods. In particular, it (a) uses a flexible Markov random field approach to segmentation that allows for a wider range of spot shapes than existing methods, including relatively common 'doughnut-shaped' spots; (b) models the image directly as background plus hybridization intensity, and estimates the two quantities simultaneously, avoiding the common logical error that estimates of foreground may be less than those of the corresponding background if the two are estimated separately; and (c) uses a probabilistic modeling approach to simultaneously perform segmentation and intensity estimation, and to compute spot quality measures. We describe two approaches to parameter estimation: a fast algorithm, based on the expectation-maximization and the iterated conditional modes algorithms, and a fully Bayesian framework. These approaches produce comparable results, and both appear to offer some advantages over other methods. We use an HIV experiment to compare our approach to two commercial software products: Spot and Arrayvision.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Alejandro Murua

Model-based clustering and data transformations for gene expression data

On Potts Model Clustering, KernelK-Means and Density Estimation

The Conditional-Potts Clustering Model

Probabilistic segmentation and intensity estimation for microarray images

Contact Info

Product

Resources

About