2019
DOI: 10.1142/s0218126619500658
|View full text |Cite
|
Sign up to set email alerts
|

DHC: A Distributed Hierarchical Clustering Algorithm for Large Datasets

Abstract: Hierarchical clustering is a classical method to provide a hierarchical representation for the purpose of data analysis. However, in practical applications, it is difficult to deal with massive datasets due to their high computation complexity. To overcome this challenge, this paper presents a novel distributed storage and computation hierarchical clustering algorithm, which has a lower time complexity than the standard hierarchical clustering algorithms. Our proposed approach is suitable for hierarchical clus… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(11 citation statements)
references
References 38 publications
(55 reference statements)
0
8
0
Order By: Relevance
“…Du and Lin [25] give a parallel HAC algorithm on a cluster of compute nodes. Zhang et al [70] propose a distributed algorithm for HAC that partitions the datasets using 𝑘d-trees or quadtrees, and then for each leaf node, finds a region where the R-NNs might exist. In parallel, each compute node finds the local R-NNs in a region, and then global R-NNs are found from the local pairs.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Du and Lin [25] give a parallel HAC algorithm on a cluster of compute nodes. Zhang et al [70] propose a distributed algorithm for HAC that partitions the datasets using 𝑘d-trees or quadtrees, and then for each leaf node, finds a region where the R-NNs might exist. In parallel, each compute node finds the local R-NNs in a region, and then global R-NNs are found from the local pairs.…”
Section: Related Workmentioning
confidence: 99%
“…Unfortunately, exact HAC algorithms usually require Ω(𝑛 2 ) work, since the distances between all pairs of points have to be computed. To accelerate exact HAC algorithms due to their significant computational cost, there have been several parallel exact HAC algorithms proposed in the literature [25,33,35,43,44,59,67,70], but most of them maintain a distance matrix, which requires quadratic memory, making them unscalable to large data sets. The only parallel exact algorithm that works for the metrics that we consider and uses subquadratic space is by Zhang et al [70], but it has not been shown to scale to large data sets.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…D-kNN is based on kNNB for distributed storage and computing of large datasets. The distributed storage and computation framework is extended from the distributed storage and computation framework for hierarchical clustering [33]. Based on kNNB, D-kNN first effectively divides large datasets into small subsets, then searches k-nearest neighbors with small subsets and finally obtains the global k-nearest neighbors.…”
Section: Distributed Storage and Computation Frameworkmentioning
confidence: 99%
“…Unfortunately, exact HAC algorithms usually require Ω(𝑛 2 ) work, since the distances between all pairs of points have to be computed. To accelerate exact HAC algorithms due to their significant computational cost, there have been several parallel exact HAC algorithms proposed in the literature [23,31,33,41,42,58,66,70], but most of them maintain a distance matrix, which requires quadratic memory, making them unscalable to large data sets. The only parallel exact algorithm that works for the metrics that we consider and uses subquadratic space is by Zhang et al [70], but it has not been shown to scale to large data sets.…”
Section: Introductionmentioning
confidence: 99%