A Survey of Parallel Clustering Algorithms Based on Spark

Xiao, Wen; Hu, Juan

doi:10.1155/2020/8884926

Cited by 10 publications

(2 citation statements)

References 69 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Clustering is divided into a plethora of types, each of which necessitates an iterative procedure, making it unsuitable for largescale data processing. As a result, the single-trafficscale evolving clustering method (ECM) had to be transformed into a parallel clustering methodology (PECM) capable of handling large amounts of data [17]. PECM (parallel evolving clustering method) is a statistics evaluation technique that runs in the Apache spark framework and leverages HDFS (Hadoop distributed file system) for statistics storage [18].…”

Section: Related Workmentioning

confidence: 99%

Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform

2023

IJIES

View full text Add to dashboard Cite

We present a new parallel density-based spatial clustering of applications with noise (DBSCAN) algorithm for spark on the google cloud platform (GCP). Statistical analysis is applied to determine DBSCAN's optimal parameters to enhance clustering performance. for scalability cost-based, R-tree partitioning is selected based on the distribution of the dataset into balanced workloads. Parallel DBSCAN consists of three parts: local DBSCAN, partitioning, and merging. Optimizing the partitioning of parallel DBSCAN is important to save time and space compared to serial DBSCAN. This approach can improve the performance and time cost of large datasets. the modified statistical cost-based (SCbs-DBSCAN) is applied to the UCI (university of california irvine) standard datasets, basic benchmark clustering and large different scales of data. For clustering performance and time cost, the experimental results show that the proposed algorithm achieve 10~15% more efficiently, and can run about 1.5x~3x faster than alternative Parallel DBSCAN method on Spark without sacrificing clustering quality.

show abstract

Section: Related Workmentioning

confidence: 99%

Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform

2023

IJIES

View full text Add to dashboard Cite

show abstract

“…Such a notably efficient KMeans-based is demonstrated in [21], whereas in [22] a highly efficient parallelization of the hierarchical agglomerative clustering method in Spark is also presented. A more detailed review on efficient parallel clustering algorithms for big data in Spark framework can be found in [29].…”

Section: Introductionmentioning

confidence: 99%

Efficient Big Text Data Clustering Algorithms using Hadoop and Spark

Gerakidis¹,

Megarchioti²,

Mamalis³

2021

IJCA

View full text Add to dashboard Cite

Document clustering is a traditional, efficient and yet quite effective, text mining technique when we need to get a better insight of the documents of a collection that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used clustering algorithms; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in text clustering over large-scale collections can lead to unacceptable time costs. In this paper we first address some of the most valuable approaches for document clustering over such 'big data' (large-scale) collections. We then present two very promising alternatives: (a) a variation of an existing K-Means-based fast clustering technique (known as BigKClustering -BKC) so that it can be applied in document clustering, and (b) a hybrid clustering approach based on a customized version of the Buckshot algorithm, which first applies a hierarchical clustering procedure on a sample of the input dataset and then it uses the results as the initial centers for a K-Means based assignment of the rest of the documents, with very few iterations. We also give highly efficient adaptations of the proposed techniques in the MapReduce model which are then experimentally tested using Apache Hadoop and Spark over a real cluster environment. As it comes out of the experiments, they both lead to acceptable clustering quality as well as to significant time improvements (compared to K-Means -especially the Buckshot-based algorithm), thus constituting very promising alternatives for big document collections.

show abstract

DPISCAN: Distributed and parallel architecture with indexing for structural clustering of massive dynamic graphs

Kumar

D’Mello

2022

Int J Data Sci Anal

View full text Add to dashboard Cite

A Survey of Parallel Clustering Algorithms Based on Spark

Cited by 10 publications

References 69 publications

Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform

Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform

Efficient Big Text Data Clustering Algorithms using Hadoop and Spark

DPISCAN: Distributed and parallel architecture with indexing for structural clustering of massive dynamic graphs

Contact Info

Product

Resources

About