DHC: A Distributed Hierarchical Clustering Algorithm for Large Datasets

Zhang, Wei; Zhang, Gongxuan; Chen, Xiaohui; Liu, Yueqi; Zhou, Xiumin; Zhou, Jianjiang

doi:10.1142/s0218126619500658

Cited by 11 publications

(11 citation statements)

References 38 publications

(55 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Du and Lin [25] give a parallel HAC algorithm on a cluster of compute nodes. Zhang et al [70] propose a distributed algorithm for HAC that partitions the datasets using 𝑘d-trees or quadtrees, and then for each leaf node, finds a region where the R-NNs might exist. In parallel, each compute node finds the local R-NNs in a region, and then global R-NNs are found from the local pairs.…”

Section: Related Workmentioning

confidence: 99%

“…Unfortunately, exact HAC algorithms usually require Ω(𝑛 2 ) work, since the distances between all pairs of points have to be computed. To accelerate exact HAC algorithms due to their significant computational cost, there have been several parallel exact HAC algorithms proposed in the literature [25,33,35,43,44,59,67,70], but most of them maintain a distance matrix, which requires quadratic memory, making them unscalable to large data sets. The only parallel exact algorithm that works for the metrics that we consider and uses subquadratic space is by Zhang et al [70], but it has not been shown to scale to large data sets.…”

Section: Introductionmentioning

confidence: 99%

“…To accelerate exact HAC algorithms due to their significant computational cost, there have been several parallel exact HAC algorithms proposed in the literature [25,33,35,43,44,59,67,70], but most of them maintain a distance matrix, which requires quadratic memory, making them unscalable to large data sets. The only parallel exact algorithm that works for the metrics that we consider and uses subquadratic space is by Zhang et al [70], but it has not been shown to scale to large data sets. In this paper, we propose a framework for designing parallel exact HAC algorithms that use linear memory, based on the classic nearest-neighbor chain algorithm.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused.Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8-110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75-54.23x self-relative speedup. Compared to state-ofthe-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…D-kNN is based on kNNB for distributed storage and computing of large datasets. The distributed storage and computation framework is extended from the distributed storage and computation framework for hierarchical clustering [33]. Based on kNNB, D-kNN first effectively divides large datasets into small subsets, then searches k-nearest neighbors with small subsets and finally obtains the global k-nearest neighbors.…”

Section: Distributed Storage and Computation Frameworkmentioning

confidence: 99%

A Distributed Storage and Computation k-Nearest Neighbor Algorithm Based Cloud-Edge Computing for Cyber-Physical-Social Systems

et al. 2020

Self Cite

View full text Add to dashboard Cite

The k-nearest neighbor (kNN) algorithm is a classic supervised machine learning algorithm. It is widely used in cyber-physical-social systems (CPSS) to analyze and mine data. However, in practical CPSS applications, the standard linear kNN algorithm struggles to efficiently process massive data sets. This paper proposes a distributed storage and computation k-nearest neighbor (D-kNN) algorithm. The D-kNN algorithm has the following advantages: First, the concept of k-nearest neighbor boundaries is proposed and the k-nearest neighbor search within the k-nearest neighbors boundaries can effectively reduce the time complexity of kNN. Second, based on the k-neighbor boundary, massive data sets beyond the main storage space are stored on distributed storage nodes. Third, the algorithm performs k-nearest neighbor searching efficiently by performing distributed calculations at each storage node. Finally, a series of experiments were performed to verify the effectiveness of the D-kNN algorithm. The experimental results show that the D-kNN algorithm based on distributed storage and calculation effectively improves the operation efficiency of k-nearest neighbor search. The algorithm can be easily and flexibly deployed in a cloud-edge computing environment to process massive data sets in CPSS. INDEX TERMS kNN, k-nearest neighbor boundary, distributed storage and computation, cloud-edge computing, CPSS.

show abstract

“…Unfortunately, exact HAC algorithms usually require Ω(𝑛 2 ) work, since the distances between all pairs of points have to be computed. To accelerate exact HAC algorithms due to their significant computational cost, there have been several parallel exact HAC algorithms proposed in the literature [23,31,33,41,42,58,66,70], but most of them maintain a distance matrix, which requires quadratic memory, making them unscalable to large data sets. The only parallel exact algorithm that works for the metrics that we consider and uses subquadratic space is by Zhang et al [70], but it has not been shown to scale to large data sets.…”

Section: Introductionmentioning

confidence: 99%

ParChain

Yu¹,

Wang²,

et al. 2021

Proc. VLDB Endow.

View full text Add to dashboard Cite

This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused. Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.

show abstract

DHC: A Distributed Hierarchical Clustering Algorithm for Large Datasets

Cited by 11 publications

References 38 publications

ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain

A Distributed Storage and Computation k-Nearest Neighbor Algorithm Based Cloud-Edge Computing for Cyber-Physical-Social Systems

ParChain

Contact Info

Product

Resources

About