Clustering is used to extract hidden patterns and similar groups from data. Therefore, clustering as a method of unsupervised learning is a crucial technique for big data analysis owing to the massive number of unlabeled objects involved. Density-based algorithms have attracted research interest, because they help to better understand complex patterns in spatial datasets that contain information about data related to co-located objects. Big data clustering is a challenging task, because the volume of data increases exponentially. However, clustering using MapReduce can help answer this challenge. In this context, density-based algorithms in MapReduce have been largely investigated in the past decade to eliminate the problem of big data clustering. Despite the diversity of the algorithms proposed, the field lacks a structured review of the available algorithms and techniques for desirable partitioning, local clustering, and merging. This study formalizes the problem of density-based clustering using MapReduce, proposes a taxonomy to categorize the proposed algorithms, and provides a systematic and comprehensive comparison of these algorithms according to the partitioning technique, type of local clustering, merging technique, and exactness of their implementations. Finally, the study highlights outstanding challenges and opportunities to contribute to the field of density-based clustering using MapReduce.
One of the main requirements in clustering spatial datasets is the discovery of clusters with arbitrary-shapes. Density-based algorithms satisfy this requirement by forming clusters as dense regions in the space that are separated by sparser regions. DENCLUE is a density-based algorithm that generates a compact mathematical form of arbitrary-shapes clusters. Although DENCLUE has proved its efficiency, it cannot handle large datasets since it requires large computation complexity. Several attempts were proposed to improve the performance of DENCLUE algorithm, including DENCLUE 2. In this study, an empirical evaluation is conducted to highlight the differences between the first DENCLUE variant which uses the Hill-Climbing search method and DENCLUE 2 variant, which uses the fast Hill-Climbing method. The study aims to provide a base for further enhancements on both algorithms. The evaluation results indicate that DENCLUE 2 is faster than DENCLUE 1. However, the first DECNLUE variant outperforms the second variant in discovering arbitrary-shapes clusters
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.