Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight

Li, Min; Yang, Chao; Sun, Qiao; Ma, Wei; Cao, Wenlong; Ao, Yulong

doi:10.1007/s11390-019-1900-5

Cited by 10 publications

(5 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It was tested on an eight-core system and achieved a significant speedup over the naive parallel implementation of Lloyd's method. Li et al [41] and Li et al [42] proposed two implementations of Lloyd's algorithm for the SW26010 processor used in the Sunway TaihuLight supercomputer. (At the time of this writing, it was fourth on the list of the Top 500 supercomputers [43]).…”

Section: Related Researchmentioning

confidence: 99%

Accelerated K-Means Algorithms for Low-Dimensional Data on Parallel Shared-Memory Systems

Kwedlo

Łubowicz

2021

IEEE Access

View full text Add to dashboard Cite

This paper considers the problem of exact accelerated algorithms for the K-means clustering of low-dimensional data on modern multi-core systems. A version of the filtering algorithm parallelized using the OpenMP (Open Multi-Processing) standard is proposed. The algorithm employs a kd-tree structure to skip some unnecessary calculations between cluster centroids and feature vectors. In our approach, both the kd-tree construction and the iterations of the K-means are parallelized using the OpenMP tasking mechanism. A new task is created for a recursive call performed during kd-tree construction and traversal. The tasks are executed in parallel by the cores of a shared-memory system. In computational experiments, we evaluated the parallel efficiency of our approach and compared its performance to the parallel Lloyd's method, a GPU (Graphics Processing Unit) formulation of the K-means algorithm, and two parallel triangle inequality-based algorithms intended for low-dimensional data. The evaluation was performed on six synthetic datasets from two distributions and seven real-life datasets. The experiments, executed on a 24core system, indicated that our version of the filtering algorithm had satisfactory or high parallel efficiency. Its runtime was much shorter than those of competing algorithms. However, the advantage of the parallel filtering algorithm decreased rapidly as the dimension of data increased.INDEX TERMS acceleration of K-means, K-means clustering, kd-trees, OpenMP tasks, parallelization

show abstract

Section: Related Researchmentioning

confidence: 99%

Accelerated K-Means Algorithms for Low-Dimensional Data on Parallel Shared-Memory Systems

Kwedlo

Łubowicz

2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…The algorithm was tested on a system with two quadcore Intel CPUs giving a significant speedup over a naive implementation. [35] and [36] describe two implementations of Lloyd's algorithm for the SW26010 many-core processor used in Sunway TaihuLight supercomputer (at the time of writing this paper it was third on the Top500 supercomputer list [37]). While the previous work focuses on fine-tuned kernel running on a single processor, the latter discusses the implementation on thousands of nodes of TaihuLight.…”

Section: Related Workmentioning

confidence: 99%

A Hybrid MPI/OpenMP Parallelization of $K$ -Means Algorithms Accelerated Using the Triangle Inequality

Kwedlo

Czochanski

2019

IEEE Access

View full text Add to dashboard Cite

The standard formulation of the K -means clustering (Lloyd's method) performs many unnecessary distance calculations. In this paper, we focus on four approaches that use the triangle inequality to avoid unnecessary distance calculations. These approaches are Drake's, Elkan's, Annulus, and Yinyang algorithms. We propose a hybrid MPI/OpenMP parallelization of these algorithms in which the dataset and the corresponding data structures storing bounds on distances are evenly divided among MPI processes. Then, in the assignment step of a K -means iteration, each MPI process computes the assignment of its portion of data using OpenMP threads. In the update step of the iteration, the cluster centroids are computed using a hierarchical all-reduce operation. In the computational experiments, we compared the strong scalability of these four algorithms with the scalability of Lloyd's algorithm, parallelized using the same approach. The results indicate that all four algorithms maintain an advantage in computing time over Lloyd's algorithm. A comparison with two software packages, whose sources are publicly available, in the same computing environment, shows that our implementations are more efficient.

show abstract

“…Li M et al implemented k-means on sw26010 manycore processor, and sustains a double-precision performance of over 348.1 Gflops. The result is 84% of the theoretical performance upper bound on a single core group [38]. Wang yichao et al employed OpenACC* to port GTC-P on the ''Sunway Taihulight'' supercomputer and achieve 2.5x speedup [39].…”

Section: Related Workmentioning

confidence: 99%

Parallel Implementation and Optimization of Regional Ocean Modeling System (ROMS) Based on Sunway SW26010 Many-Core Processor

et al. 2019

View full text Add to dashboard Cite

Nowadays, the ocean numerical models are gradually developing towards multi-physical process and high resolution, with the increment of measured ocean data and more in-depth research in ocean field. Therefore, general computing capability is no longer able to meet these models' needs. It is necessary to utilize more powerful hardware and parallel software to process the ocean numerical model programs. China has made great process in the research and development of homegrown high performance processors, and sunway sw26010 many-core processor is the most outstanding representative. This paper focuses the lag of the ocean numerical model software matched with homegrown processors, and makes parallel implementation and optimization to regional ocean modeling system (ROMS) based on sunway sw26010 many-core processor for the first time. Furthermore, three kinds of programming methods are utilized in this paper, including OpenACC*, athread with fortran and athread with C. The comparison among these programming methods has been made, from programming method, workload and execution efficiency, which has a practical guiding significance for the programmers that use sunway sw26010 manycore processors. The evaluation measures the execution times and speedups of model kernel and total ROMS with different optimizations, input datasets and numbers of computing processing elements (CPEs). The result shows that, to compare with original ROMS, the speedup of optimized hotspot program can be up to 3.69x.

show abstract

Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight

Cited by 10 publications

References 41 publications

Accelerated K-Means Algorithms for Low-Dimensional Data on Parallel Shared-Memory Systems

Accelerated K-Means Algorithms for Low-Dimensional Data on Parallel Shared-Memory Systems

A Hybrid MPI/OpenMP Parallelization of $K$ -Means Algorithms Accelerated Using the Triangle Inequality

Parallel Implementation and Optimization of Regional Ocean Modeling System (ROMS) Based on Sunway SW26010 Many-Core Processor

Contact Info

Product

Resources

About