Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract "natural" group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quantifies the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classification risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classification risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in real-world problems.
Recognition of urban structures is of interest in cartography and urban modelling. While a broad range of typologies of urban patterns have been published in the last century, relatively little research on the automated recognition of such structures exists. This work presents a sample-based approach for the recognition of five types of urban structures: (1) inner city areas, (2) industrial and commercial areas, (3) urban areas, (4) suburban areas and (5) rural areas. The classification approach is based only on the characterisation of building geometries with morphological measures derived from perceptual principles of Gestalt psychology. Thereby, size, shape and density of buildings are evaluated. After defining the research questions we develop the classification methodology and evaluate the approach with respect to several aspects. The experiments focus on the impact of different classification algorithms, correlations and contributions of measures, parameterisation of buffer-based indices, and mode filtering. In addition to that, we investigate the influence of scale and regional factors. The results show that the chosen approach is generally successful. It turns out that scale, algorithm parameterisation, and regional heterogeneity of building structures substantially influence the classification performance.
No abstract
Model order selection and cue combination are both difficult open problems in the area of clustering. In this work we build upon stability-based approaches to develop a new method for automatic model order selection and cue combination with applications to visual grouping. Novel features of our approach include the ability to detect multiple stable clusterings (instead of only one), a simpler means of calculating stability that does not require training a classifier, and a new characterization of the space of stabilities for a continuum of segmentations that provides for an efficient sampling scheme. Our contribution is a framework for visual grouping that frees the user from the hassles of parameter tuning and model order selection: the input is an image, the output is a shortlist of segmentations.
A novel approach to class discovery in gene expression datasets is presented. In the context of clinical diagnosis, the central goal of class discovery algorithms is to simultaneously find putative (sub-)types of diseases and to identify informative subsets of genes with disease-type specific expression profile. Contrary to many other approaches in the literature, the method presented implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. The usual combinatorial problems associated with wrapper approaches are overcome by a Bayesian inference mechanism. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of the optimization method is selected by a resampling-based stability analysis. Experiments with Leukemia and Lymphoma datasets demonstrate that our method is able to correctly infer partitions and corresponding subsets of genes which both are relevant in a biological sense. Moreover, the frequently observed problem of ambiguities caused by different but equally high-scoring partitions is successfully overcome by the model selection method proposed.
Data clustering represents an important tool in exploratory data analysis. The lack of objective criteria render model selection as well as the identification of robust solutions particularly difficult. The use of a stability assessment and the combination of multiple clustering solutions represents an important ingredient to achieve the goal of finding useful partitions. In this work, we propose a novel way of combining multiple clustering solutions for both, hard and soft partitions: the approach is based on modeling the probability that two objects are grouped together. An efficient EM optimization strategy is employed in order to estimate the model parameters. Our proposal can also be extended in order to emphasize the signal more strongly by weighting individual base clustering solutions according to their consistency with the prediction for previously unseen objects. In addition to that, the probabilistic model supports an outof-sample extension that (i) makes it possible to assign previously unseen objects to classes of the combined solution and (ii) renders the efficient aggregation of solutions possible. In this work, we also shed some light on the usefulness of such combination approaches. In the experimental result section, we demonstrate the competitive performance of our proposal in comparison with other recently proposed methods for combining multiple classifications of a finite data set.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.