Murilo Coelho Naldi scite author profile

One of the most influential algorithms in data mining, k-means, is broadly used in practical tasks for its simplicity, computational efficiency and effectiveness in high dimensional problems. However, k-means has two major drawbacks, which are the need to choose the number of clusters, k, and the sensibility to the initial prototypes' position. In this work, systematic, evolutionary and order heuristics used to suppress these drawbacks are compared. 27 variants of 4 algorithmic approaches are used to partition 324 synthetic data sets and the obtained results are compared.

show abstract

Hierarchical Density-Based Clustering Using MapReduce

Santos

Syed

Naldi

et al. 2021

IEEE Trans. Big Data

View full text Add to dashboard Cite

Hierarchical density-based clustering is a powerful tool for exploratory data analysis, which can play an important role in the understanding and organization of datasets. However, its applicability to large datasets is limited because the computational complexity of hierarchical clustering methods has a quadratic lower bound in the number of objects to be clustered. MapReduce is a popular programming model to speed up data mining and machine learning algorithms operating on large, possibly distributed datasets. In the literature, there have been attempts to parallelize algorithms such as Single-Linkage, which in principle can also be extended to the broader scope of hierarchical density-based clustering, but hierarchical clustering algorithms are inherently difficult to parallelize with MapReduce. In this paper, we discuss why adapting previous approaches to parallelize Single-Linkage clustering using MapReduce leads to very inefficient solutions when one wants to compute density-based clustering hierarchies. Preliminarily, we discuss one such solution, which is based on an exact, yet very computationally demanding, random blocks parallelization scheme. To be able to efficiently apply hierarchical density-based clustering to large datasets using MapReduce, we then propose a different parallelization scheme that computes an approximate clustering hierarchy based on a much faster, recursive sampling approach. This approach is based on HDBSCAN*, the state-of-the-art hierarchical density-based clustering algorithm, combined with a data summarization technique called data bubbles. The proposed method is evaluated in terms of both runtime and quality of the approximation on a number of datasets, showing its effectiveness and scalability.

show abstract

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Murilo Coelho Naldi

Efficiency issues of evolutionary k-means

Cluster ensemble selection based on relative validity indexes

Técnicas de combinação para agrupamento centralizado e distribuído de dados

Comparing convolutional neural networks and preprocessing techniques for HEp-2 cell classification in immunofluorescence images

Improving k-means through distributed scalable metaheuristics

Scalable Batch Stream Clustering with k Estimation

Comparison Among Methods for k Estimation in k-means

Hierarchical Density-Based Clustering Using MapReduce

Contact Info

Product

Resources

About