A Scalable Hierarchical Clustering Algorithm Using Spark

Chen, Jin; Liu, Ruoqian; Chen, Zhengzhang; Hendrix, William; Agrawal, Ankit; Choudhary, Alok

doi:10.1109/bigdataservice.2015.67

Cited by 38 publications

(10 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The machine learning algorithm library based on Spark named MLlib [35] also contains a hierarchical clustering algorithm; it is a parallel implementation of bisection k-means algorithm [48], which is developed based on paper [49]. Jin et al proposed SHAS [50] that parallelizes the classical SHC algorithm using Spark. The algorithm includes three stages: data point division, local clustering and merging.…”

Section: Related Workmentioning

confidence: 99%

PSubCLUS: A Parallel Subspace Clustering Algorithm Based On Spark

Xiao

2021

IEEE Access

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

PSubCLUS: A Parallel Subspace Clustering Algorithm Based On Spark

Xiao

2021

IEEE Access

View full text Add to dashboard Cite

“…Jin et al proposed a parallel SHC algorithm based on Spark named SHAS [62]. e framework of SHAS is the same as Figure 3, which mainly includes three stages: data point division, local clustering, and cluster merging.…”

Section: Parallel Hierarchical Clustering Algorithmmentioning

confidence: 99%

A Survey of Parallel Clustering Algorithms Based on Spark

Xiao

2020

Scientific Programming

View full text Add to dashboard Cite

Clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of data, the classical clustering algorithms cannot meet the requirements of clustering for big data. Spark is one of the most popular parallel processing platforms for big data, and many researchers have proposed many parallel clustering algorithms based on Spark. In this paper, the existing parallel clustering algorithms based on Spark are classified and summarized, the parallel design framework of each kind of algorithms is discussed, and after comparing different kinds of algorithms, the direction of the future research is discussed.

show abstract

“…Regarding the superiorities of Spark, recently some clustering approaches have been proposed based on Spark. The authors of a past paper [26] presented a scalable hierarchical clustering algorithm using Spark. By formulating Single-Linkage hierarchical clustering as a Minimum Spanning Tree (MST) problem, it was shown that Spark is totally successful in finding clusters through natural iterative process with nice scalability and high performance.…”

Section: Preliminaries Literature Review and Related Workmentioning

confidence: 99%

A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

Hosseini

Kiani

2018

Symmetry

View full text Add to dashboard Cite

Unsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark framework and tested on some of the prevalent datasets. In the first step of this algorithm, the input data is divided into partitions using a Bayesian type of Locality Sensitive Hashing (LSH). Partitioning makes the processing fully parallel and much simpler by avoiding unneeded calculations. Each of the proposed algorithm steps is completely independent of the others and no serial bottleneck exists all over the clustering procedure. Locality preservation also filters out the outliers and enhances the robustness of the proposed approach. Density is defined on the basis of Ordered Weighted Averaging (OWA) distance which makes clusters more homogenous. According to the density of each node, the local density peaks will be detected adaptively. By merging the local peaks, final cluster centers will be obtained and other data points will be a member of the cluster with the nearest center. The proposed method has been implemented and compared with similar recently published researches. Cluster validity indexes achieved from the proposed method shows its superiorities in precision and noise robustness in comparison with recent researches. Comparison with similar approaches also shows superiorities of the proposed method in scalability, high performance, and low computation cost. The proposed method is a general clustering approach and it has been used in gene expression clustering as a sample of its application.

show abstract

A Scalable Hierarchical Clustering Algorithm Using Spark

Cited by 38 publications

References 17 publications

PSubCLUS: A Parallel Subspace Clustering Algorithm Based On Spark

PSubCLUS: A Parallel Subspace Clustering Algorithm Based On Spark

A Survey of Parallel Clustering Algorithms Based on Spark

A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

Contact Info

Product

Resources

About