Large-Scale Hierarchical k-means for Heterogeneous Many-Core Supercomputers

Li, Liandeng; Yu, Teng; Zhao, Wenlai; Fu, Haohuan; Wang, Chenyu; Tan, Li; Yang, Guangwen; Thomson, John

doi:10.1109/sc.2018.00016

Cited by 11 publications

(7 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our K-Means implementation partitions the problem based on data size, since its application to climate data is a weak scaling problem. We process much larger datasizes, though [51] showcases good performance for much higher dimensionality, up to O(10E6), and clusters O(10E6) than our use case. For comparison, in our capstone problem, in the K-Means stage of DisCo workflow we process ∼70E9 lightcones (∼70E6/node) of 84 dimensions into 8 clusters in 2.32 s/iteration on Intel E5-2698 v3 (vs. 2.5E6 samples of 68 dimensions into 10,000 clusters in 2.42 s/iteration on 16 nodes of Intel i7-3770K processors in [51]).…”

Section: Related Workmentioning

confidence: 97%

“…[49] is an extension of this work for larger datasets of billions of points, and [50] optimizes K-Means performance on Intel KNC processors by efficient vectorization. The authors of [51] propose a hierarchical scheme for partitioning data based on data flow, centroids(clusters), and dimensions. Our K-Means implementation partitions the problem based on data size, since its application to climate data is a weak scaling problem.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

DisCo: Physics-Based Unsupervised Discovery of Coherent Structures in Spatiotemporal Systems

Rupe

Prabhat

Crutchfield

et al. 2019

2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

View full text Add to dashboard Cite

Extracting actionable insight from complex unlabeled scientific data is an open challenge and key to unlocking data-driven discovery in science. Complementary and alternative to supervised machine learning approaches, unsupervised physics-based methods based on behavior-driven theories hold great promise. Due to computational limitations, practical application on real-world domain science problems has lagged far behind theoretical development. However, powerful modern supercomputers provide the opportunity to narrow the gap between theory and practical application. We present our first step towards bridging this divide -DisCo -a high-performance distributed workflow for the behavior-driven local causal state theory. DisCo provides a scalable unsupervised physics-based representation learning method that decomposes spatiotemporal systems into their structurally relevant components, which are captured by the latent local causal state variables. Complex spatiotemporal systems are generally highly structured and organize around a lower-dimensional skeleton of coherent structures, and in several firsts we demonstrate the efficacy of DisCo in capturing such structures from observational and simulated scientific data. To the best of our knowledge, DisCo is also the first application software developed entirely in Python to scale to over 1000 machine nodes, providing good performance along with ensuring domain scientists' productivity. We developed scalable, performant methods optimized for Intel many-core processors that will be upstreamed to open-source Python library packages. Our capstone experiment, using newly developed DisCo workflow and libraries, performs unsupervised spacetime segmentation analysis of CAM5.1 climate simulation data, processing an unprecedented 89.5 TB in 6.6 minutes end-to-end using 1024 Intel Haswell nodes on the Cori supercomputer obtaining 91% weak-scaling and 64% strong-scaling efficiency. This enables us to achieve state-of-the-art unsupervised segmentation of coherent spatiotemporal structures in complex fluid flows.Recently, supervised DL techniques have been applied to address this problem [24], [25], [26] including one of the 2018 Gordon Bell award winners [27]. However, there is an immediate and daunting challenge for these supervised approaches: ground-truth labels do not exist for pixel-level identification of extreme weather events [21]. The DL models used in the above studies are trained using the automated heuristics of TECA [20] for proximate labels. While the results in [24] qualitatively show that DL can improve upon TECA, the results in [26] reach accuracy rates over 97%, essentially reproducing the output of TECA. The supervised learning paradigm of optimizing objective metrics (e.g. training and generalization error) breaks down here [8] since TECA is not ground truth and we do not know how to train a DL model to disagree with TECA in just the right way to get closer to "ground truth".

show abstract

Section: Related Workmentioning

confidence: 97%

Section: Related Workmentioning

confidence: 99%

DisCo: Physics-Based Unsupervised Discovery of Coherent Structures in Spatiotemporal Systems

Rupe

Prabhat

Crutchfield

et al. 2019

2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

View full text Add to dashboard Cite

show abstract

“…It was tested on an eight-core system and achieved a significant speedup over the naive parallel implementation of Lloyd's method. Li et al [41] and Li et al [42] proposed two implementations of Lloyd's algorithm for the SW26010 processor used in the Sunway TaihuLight supercomputer. (At the time of this writing, it was fourth on the list of the Top 500 supercomputers [43]).…”

Section: Related Researchmentioning

confidence: 99%

Accelerated K-Means Algorithms for Low-Dimensional Data on Parallel Shared-Memory Systems

Kwedlo

Łubowicz

2021

IEEE Access

View full text Add to dashboard Cite

This paper considers the problem of exact accelerated algorithms for the K-means clustering of low-dimensional data on modern multi-core systems. A version of the filtering algorithm parallelized using the OpenMP (Open Multi-Processing) standard is proposed. The algorithm employs a kd-tree structure to skip some unnecessary calculations between cluster centroids and feature vectors. In our approach, both the kd-tree construction and the iterations of the K-means are parallelized using the OpenMP tasking mechanism. A new task is created for a recursive call performed during kd-tree construction and traversal. The tasks are executed in parallel by the cores of a shared-memory system. In computational experiments, we evaluated the parallel efficiency of our approach and compared its performance to the parallel Lloyd's method, a GPU (Graphics Processing Unit) formulation of the K-means algorithm, and two parallel triangle inequality-based algorithms intended for low-dimensional data. The evaluation was performed on six synthetic datasets from two distributions and seven real-life datasets. The experiments, executed on a 24core system, indicated that our version of the filtering algorithm had satisfactory or high parallel efficiency. Its runtime was much shorter than those of competing algorithms. However, the advantage of the parallel filtering algorithm decreased rapidly as the dimension of data increased.INDEX TERMS acceleration of K-means, K-means clustering, kd-trees, OpenMP tasks, parallelization

show abstract

“…The algorithm was tested on a system with two quadcore Intel CPUs giving a significant speedup over a naive implementation. [35] and [36] describe two implementations of Lloyd's algorithm for the SW26010 many-core processor used in Sunway TaihuLight supercomputer (at the time of writing this paper it was third on the Top500 supercomputer list [37]). While the previous work focuses on fine-tuned kernel running on a single processor, the latter discusses the implementation on thousands of nodes of TaihuLight.…”

Section: Related Workmentioning

confidence: 99%

A Hybrid MPI/OpenMP Parallelization of $K$ -Means Algorithms Accelerated Using the Triangle Inequality

Kwedlo

Czochanski

2019

IEEE Access

View full text Add to dashboard Cite

The standard formulation of the K -means clustering (Lloyd's method) performs many unnecessary distance calculations. In this paper, we focus on four approaches that use the triangle inequality to avoid unnecessary distance calculations. These approaches are Drake's, Elkan's, Annulus, and Yinyang algorithms. We propose a hybrid MPI/OpenMP parallelization of these algorithms in which the dataset and the corresponding data structures storing bounds on distances are evenly divided among MPI processes. Then, in the assignment step of a K -means iteration, each MPI process computes the assignment of its portion of data using OpenMP threads. In the update step of the iteration, the cluster centroids are computed using a hierarchical all-reduce operation. In the computational experiments, we compared the strong scalability of these four algorithms with the scalability of Lloyd's algorithm, parallelized using the same approach. The results indicate that all four algorithms maintain an advantage in computing time over Lloyd's algorithm. A comparison with two software packages, whose sources are publicly available, in the same computing environment, shows that our implementations are more efficient.

show abstract

Large-Scale Hierarchical k-means for Heterogeneous Many-Core Supercomputers

Cited by 11 publications

References 29 publications

DisCo: Physics-Based Unsupervised Discovery of Coherent Structures in Spatiotemporal Systems

DisCo: Physics-Based Unsupervised Discovery of Coherent Structures in Spatiotemporal Systems

Accelerated K-Means Algorithms for Low-Dimensional Data on Parallel Shared-Memory Systems

A Hybrid MPI/OpenMP Parallelization of $K$ -Means Algorithms Accelerated Using the Triangle Inequality

Contact Info

Product

Resources

About