Feature selection for clustering categorical data with an embedded modelling approach

Silvestre, Cláudia Marisa Vasconcelos; Cardoso, Margarida G. M. S.; Figueiredo, Mário A. T.

doi:10.1111/exsy.12082

Cited by 38 publications

(12 citation statements)

References 30 publications

(42 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Coding criteria usually are used for comparing two models (like AIC or BIC criteria). Silvestre et al (2015) showed how to apply the MML criterion simultaneously with a clustering method. This is similar to our algorithm, which reduces redundant clusters on-line.…”

Section: Model Selection Criteriamentioning

confidence: 99%

Efficient mixture model for clustering of sparse high dimensional binary data

Śmieja

Hajto

Tabor

2019

Data Min Knowl Disc

View full text Add to dashboard Cite

Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.

show abstract

Section: Model Selection Criteriamentioning

confidence: 99%

Efficient mixture model for clustering of sparse high dimensional binary data

Śmieja

Hajto

Tabor

2019

Data Min Knowl Disc

View full text Add to dashboard Cite

show abstract

“…On the other hand, the selection of features that minimize redundancy is superior to feature reduction in terms of interpretability (Alelyani Salem et al, 2013) and performance (Ronan et al, 2016). Problems like the one our method is concerned with, binary feature selection for clustering, have rarely been addressed though, while most of the studies have focussed on numerical variables (Silvestre, Cardoso, & Figueiredo, 2015). To our knowledge, only a handful of works did explore clustering in the presence of categorical (thus also binary, in particular) data (Bontemps & Toussile, 2013;Silvestre et al, 2015).…”

Section: On the Strengths And Limitations Of The Algorithmmentioning

confidence: 99%

“…Problems like the one our method is concerned with, binary feature selection for clustering, have rarely been addressed though, while most of the studies have focussed on numerical variables (Silvestre, Cardoso, & Figueiredo, 2015). To our knowledge, only a handful of works did explore clustering in the presence of categorical (thus also binary, in particular) data (Bontemps & Toussile, 2013;Silvestre et al, 2015). The methods therein developed make certain assumptions on the data and only solve the feature selection problem by simultaneously targeting a distribution in the desired number of clusters, which would not straightforwardly align with the rest of the pipeline in our algorithm.…”

Section: On the Strengths And Limitations Of The Algorithmmentioning

confidence: 99%

Detecting brain network communities: considering the role of information flow and its different temporal scales

Sanchez-Rodriguez

Iturria-Medina

Mouchès

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…In 'Feature selection for clustering categorical data with an embedded modelling approach', Silvestre et al (2014) present a novel approach that simultaneously clusters categorical data and selects relevant features. The approach is based on a Gaussian mixture model, where the minimum message length criterion is used to guide the selection of the relevant features and a modified expectation-maximization algorithm estimates the model parameters.…”

Section: Contents Of the Special Issuementioning

confidence: 99%