Krzysztof Hajto scite author profile

Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.

show abstract

Split-and-merge Tweak in Cross Entropy Clustering

Hajto¹,

Kamieniecki²,

Misztal³

et al. 2017

View full text Add to dashboard Cite

Part 3: Data Analysis and Information RetrievalInternational audienceIn order to solve the local convergence problem of the Cross Entropy Clustering algorithm, a split-and-merge operation is introduced to escape from local minima and reach a better solution. We describe the theoretical aspects of the method in a limited space, present a few strategies of tweaking the clustering algorithm and compare them with existing solutions. The experiments show that the presented approach increases flexibility and effectiveness of the whole algorithm

show abstract

Efficient mixture model for clustering of sparse high dimensional binary data

Śmieja¹,

Hajto²,

Tabor³

2017

Preprint

View full text Add to dashboard Cite

In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative.In contrast to classical mixture models based on EM algorithm, SparseMix:-is especially designed for the processing of sparse data, -can be efficiently realized by an on-line Hartigan optimization algorithm, -is able to automatically reduce unnecessary clusters.We perform extensive experimental studies on various types of data, which confirm that SparseMix builds partitions with higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Krzysztof Hajto

Efficient mixture model for clustering of sparse high dimensional binary data

Split-and-merge Tweak in Cross Entropy Clustering

Efficient mixture model for clustering of sparse high dimensional binary data

Contact Info

Product

Resources

About