Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Hou, Kaixi; Feng, Wu-chun; Che, Shuai

doi:10.1109/ipdpsw.2017.155

Cited by 27 publications

(15 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A great deal of software solutions has been published on accelerating sparse algebra kernels, mostly for SpMV [4], [9]- [11], [24], [25], [29], [36], [50], [53], [63], [64], but also for SpMM [35], [67], [70]. Most of these works are based on format and data transformations, where block-based sparse matrix representations have received most attention for two main reasons: 1) sparse matrices in real applications generally have a block sub-structure, and 2) on-chip memory requests may be decreased when using block relative indices instead of directly using row/column ones.…”

Section: Related Workmentioning

confidence: 99%

VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations

Pavón

Valdivieso

Barredo

et al. 2021

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations

Pavón

Valdivieso

Barredo

et al. 2021

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

“…Hou et al [81] proposed an auto-tuning framework for AMD APU platforms to find appropriate binning scheme and select appropriate kernel for each bin. The process of grouping rows with similar number of nonzeros together is referred to as binning by the authors.…”

Section: Literature Surveymentioning

confidence: 99%

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

et al. 2019

View full text Add to dashboard Cite

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

show abstract

“…It was shown that this approach appears to be the least efficient [14,15]. This follows from the overhead due to the sparse matrix format, from non-regular memory access, from a very low flop-to-byte ratio [21,22], and from problems concerning load imbalance [23]. Since SpMV is a memory-bound procedure, performance optimizations do not overcome the issue of considerable memory consumption.…”

Section: Ritz-galerkin Formulationmentioning

confidence: 99%

Tensor B-Spline Numerical Methods for PDEs: a High-Performance Alternative to FEM

Shulga,

Morozov,

Roth

et al. 2019

Preprint

View full text Add to dashboard Cite

Tensor B-spline methods are a high-performance alternative to solve partial differential equations (PDEs). This paper gives an overview on the principles of Tensor B-spline methodology, shows their use and analyzes their performance in application examples, and discusses its merits. Tensors preserve the dimensional structure of a discretized PDE, which makes it possible to develop highly efficient computational solvers. B-splines provide high-quality approximations, lead to a sparse structure of the system operator represented by shift-invariant separable kernels in the domain, and are mesh-free by construction. Further, high-order bases can easily be constructed from B-splines. In order to demonstrate the advantageous numerical performance of tensor B-spline methods, we studied the solution of a largescale heat-equation problem (consisting of roughly 0.8 billion nodes!) on a heterogeneous workstation consisting of multi-core CPU and GPUs. Our experimental results nicely confirm the excellent numerical approximation properties of tensor B-splines, and their unique combination of high computational efficiency and low memory consumption, thereby showing huge improvements over standard finite-element methods (FEM).

show abstract

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Cited by 27 publications

References 33 publications

VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations

VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Tensor B-Spline Numerical Methods for PDEs: a High-Performance Alternative to FEM

Contact Info

Product

Resources

About