2015 IEEE First International Conference on Big Data Computing Service and Applications 2015
DOI: 10.1109/bigdataservice.2015.67
|View full text |Cite
|
Sign up to set email alerts
|

A Scalable Hierarchical Clustering Algorithm Using Spark

Abstract: Clustering is often an essential first step in data mining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 38 publications
(10 citation statements)
references
References 17 publications
0
10
0
Order By: Relevance
“…The machine learning algorithm library based on Spark named MLlib [35] also contains a hierarchical clustering algorithm; it is a parallel implementation of bisection k-means algorithm [48], which is developed based on paper [49]. Jin et al proposed SHAS [50] that parallelizes the classical SHC algorithm using Spark. The algorithm includes three stages: data point division, local clustering and merging.…”
Section: Related Workmentioning
confidence: 99%
“…The machine learning algorithm library based on Spark named MLlib [35] also contains a hierarchical clustering algorithm; it is a parallel implementation of bisection k-means algorithm [48], which is developed based on paper [49]. Jin et al proposed SHAS [50] that parallelizes the classical SHC algorithm using Spark. The algorithm includes three stages: data point division, local clustering and merging.…”
Section: Related Workmentioning
confidence: 99%
“…Jin et al proposed a parallel SHC algorithm based on Spark named SHAS [62]. e framework of SHAS is the same as Figure 3, which mainly includes three stages: data point division, local clustering, and cluster merging.…”
Section: Parallel Hierarchical Clustering Algorithmmentioning
confidence: 99%
“…Regarding the superiorities of Spark, recently some clustering approaches have been proposed based on Spark. The authors of a past paper [26] presented a scalable hierarchical clustering algorithm using Spark. By formulating Single-Linkage hierarchical clustering as a Minimum Spanning Tree (MST) problem, it was shown that Spark is totally successful in finding clusters through natural iterative process with nice scalability and high performance.…”
Section: Preliminaries Literature Review and Related Workmentioning
confidence: 99%