Enabling and scaling the HPCG benchmark on the newest generation Sunway supercomputer with 42 million heterogeneous cores

Zhu, Qianchao; Luo, Hao; Yang, Chao; Ding, Mingshuo; Yin, Wen; Yuan, Xinhui

doi:10.1145/3458817.3476158

Cited by 23 publications

(11 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that we also observe a similar execution time distribution on other hardware platforms used in this paper. Moreover, independent studies have shown that SYMGS and SPMV operations are memory-bounded in HPCG, each has a low computation-tomemory ratio of 0.152 and 0.156 flops/byte, respectively [63]. Such a low arithmetic intensity further highlights the need of memoryaware optimization for MG.…”

Section: Overhead Of Symgsmentioning

confidence: 99%

“…However, little work has attempted to optimize the memory access latency of SYMGS on multi-core CPUs. SYMGS is known to be memory-bounded because the algorithm needs to access large, sparse matrices that cannot fit into the last level cache and the kernel computation has a low arithmetic intensity [63]. Reducing the memory access latency is essential for gaining further performance improvement for SYMGS.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing Multi-grid Computation and Parallelization on Multi-cores

Yang

Fan

et al. 2023

Proceedings of the 37th International Conference on Supercomputing

View full text Add to dashboard Cite

Multigrid algorithms are widely used to solve large-scale sparse linear systems, which is essential for many high-performance workloads. The symmetric Gauss-Seidel (SYMGS) method is often responsible for the performance bottleneck of MG. This paper presents new methods to parallelize and enhance the computation and parallelization efficiency of the SYMGS and MG algorithms on multi-core CPUs. Our solution employs a matrix splitting strategy and a revised computation formula to decrease the computation operations and memory accesses in SYMGS. With this new SYMGS strategy, we can then merge the two most time-consuming components of MG. On top of these, we propose a new asynchronous parallelization scheme to reduce the synchronization overhead when parallelizing SYMGS. We demonstrate the benefit of our techniques by integrating them with the HPCG benchmark and two real-life applications. Evaluation conducted on four architectures, including three ARMv8 and one x86, shows that our techniques greatly surpass the performance of engineer-and vendor-tuned implementations across various workloads and platforms. CCS CONCEPTS• Mathematics of computing → Solvers; Mathematical software performance; • Computing methodologies → Massively parallel algorithms.

show abstract

Section: Overhead Of Symgsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Optimizing Multi-grid Computation and Parallelization on Multi-cores

Yang

Fan

et al. 2023

Proceedings of the 37th International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…It turns out such prohibition limits the performance drastically. It is reported that implementing the HPCG computation in a matrix-free form significantly improves the performance, by 4.67× on the New Sunway supercomputer [Zhu et al 2021]. When possible, HPC researchers still seek matrix-free approaches even for implicit approaches, e.g., the 2016 Gordon Bell Prize winner [Yang et al 2016] designed and manually implemented a geometry-based pipelined ILU method that maps the data dependency to hardwaresupported inter-core communication, which is a case for further optimizing with the sparsity pattern in hand.…”

Section: Solving Differential Equations On Structured Gridsmentioning

confidence: 99%

Programming Matrices as Staged Sparse Rows to Generate Efficient Matrix-free Differential Equation Solver

Cao¹,

Tang²,

Yu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Solving differential equations is a critical task in scientific computing. Domain-specific languages (DSLs) have been a promising direction in achieving performance and productivity, but the current state of the art only supports stencil computation, leaving solvers requiring loop-carried dependencies aside. Alternatively, sparse matrices can represent such equation solvers and are more general than existing DSLs, but the performance is sacrificed.This paper points out that sparse matrices can be represented as programs instead of data, having both the generality from the matrix-based representation and the performance from program optimizations. Based on the idea, we propose the Staged Sparse Row (SSR) sparse matrix representation that can efficiently cover applications on structured grids. With SSR representation, users can intuitively define SSR matrices using generator functions and use SSR matrices through a concise object-oriented interface. SSR matrices can then be chained and applied to construct the algorithm, including those with loop-carried dependences. We then apply a set of dedicated optimizations, and ultimately simplify the SSR matrix-based codes into straightforward matrix-free ones, which are efficient and friendly for further analysis.Implementing BT pseudo application in the NAS Parallel Benchmark, with less than 10% lines of code compared with the matrix-free reference FORTRAN implementation, we achieved up to 92.8% performance. Implementing a matrix-free variant for the High-Performance Conjugate Gradient benchmark, we achieve 3.29× performance compared with the reference implementation, while our implementation shares the same algorithm on the same programming abstraction, which is sparse matrices.

show abstract

“…Efforts have been made to optimize SpMV by optimizing the sparse matrix storage format [5], [9]- [11] and the computation kernel [12], [13]. These prior optimizations have primarily focused on the computation of a single, isolated SpMV invocation.…”

Section: Introductionmentioning

confidence: 99%

Memory-aware Optimization for Sequences of Sparse Matrix-Vector Multiplications

Zhang

Fan

et al. 2023

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

This paper presents a novel approach to optimize multiple invocations of a sparse matrix-vector multiplication (SpMV) kernel performed on the same sparse matrix A and dense vector x, like Ax, A 2 x, • • • , A k x, and their linear combinations such as Ax + A 2 x. Such computations are frequently used in scientific applications for solving linear equations and in multigrid methods. Existing SpMV optimization techniques typically focus on a single SpMV invocation and do not consider opportunities for optimization across a sequence of SpMV operations (SSpMV), leaving much room for performance improvement. Our work aims to bridge this performance gap. It achieve this by partitioning the sparse matrix into submatrices and devising a new computation pipeline that reduces memory access to the sparse matrix and exploits the data locality of the dense vector of SpMV. Additionally, we demonstrate how our approach can be integrated with parallelization schemes to further improve performance. We evaluate our approach on four distinct multicore systems, including three ARM and one Intel platform. Experimental results show that our techniques improve the standard implementation and the highly-optimized Intel math kernel library (MKL) by a large margin.

show abstract

Enabling and scaling the HPCG benchmark on the newest generation Sunway supercomputer with 42 million heterogeneous cores

Cited by 23 publications

References 39 publications

Optimizing Multi-grid Computation and Parallelization on Multi-cores

Optimizing Multi-grid Computation and Parallelization on Multi-cores

Programming Matrices as Staged Sparse Rows to Generate Efficient Matrix-free Differential Equation Solver

Memory-aware Optimization for Sequences of Sparse Matrix-Vector Multiplications

Contact Info

Product

Resources

About