Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs

Mukunoki, Daichi; Takahashi, Daisuke

doi:10.1007/978-3-642-39640-3_15

Cited by 9 publications

(7 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also acknowledge Dr. Giles for having informed us know that their code in [87] is now part of NVIDIA's cuSPARSE library and Dr. Mukunoki for having provided us their code in [79]. 39 186 70 68 65 cop20k A 23 138 21 47 46 FEM 3D thermal2 28 169 57 55 59 thermal2 23 127 26 33 35 thermomech TK 25 125 23 39 34 nlpkkt80 27 179 59 59 62 pde100 25 176 48 47 56 webbase-1M 10 25 -11 54 dc1 1 3 -4 537 amazon0302 18 111 26 32 44 roadNet-CA 14 123 16…”

Section: Discussionmentioning

confidence: 99%

“…Mukunoki et al [79] also proposed optimization techniques for the CSR format but on the more recent NVIDIA Kepler architecture, taking advantage of three new features: 48KB read-only data cache, shuffle instructions, and expanding the number of thread blocks in the x-direction that can be defined in a grid.…”

Section: Csr Optimizationsmentioning

confidence: 99%

“…• Two CSR variants: the one originally published by [87] and currently included in the NVIDIA cuSPARSE library, and the one published by [79]; they will be referred to by the names CSR and JSR, respectively;…”

Section: Sparse Storage Formatsmentioning

confidence: 99%

“…The sparse matrices we selected from the UFL collection represent different kinds of real applications including structural analysis, economics, electromagnetism, computational fluid dynamics, thermal diffusion, graph problems. The UFL collection has been previously used in most of the works regarding SpMV on GPGPUs, among them [7,11,22,38,76,78,79]. The UFL collection subset we selected includes some large sparse matrices such as Cube Coup dt0, StocF-1465, nlpkkt120 and webbase-1M, i.e., matrices having more than a million rows and up to a hundred million nonzeros; moreover, the optimization and graph matrices have a structure that is significantly different from PDE discretization matrices.…”

Section: Test Matricesmentioning

confidence: 99%

See 3 more Smart Citations

Sparse Matrix-Vector Multiplication on GPGPUs

Filippone

Cardellini

Barbieri

et al. 2017

ACM Trans. Math. Softw.

103

View full text Add to dashboard Cite

The multiplication of a sparse matrix by a dense vector (SpMV) is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrixvector multiplication is therefore crucial and has been the subject of an immense amount of research, with interest renewed with every major new trend in high performance computing architectures. The introduction of General Purpose Graphics Processing Units (GPGPUs) is no exception, and many articles have been devoted to this problem.With this paper we provide a review of the techniques for implementing the SpMV kernel on GPGPUs that have appeared in the literature of the last few years. We discuss the issues and tradeoffs that have been encountered by the various researchers, and a list of solutions, organized in categories according to common features. We also provide a performance comparison across different GPGPU models and on a set of test matrices coming from various application domains.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Csr Optimizationsmentioning

confidence: 99%

Section: Sparse Storage Formatsmentioning

confidence: 99%

Section: Test Matricesmentioning

confidence: 99%

See 2 more Smart Citations

Sparse Matrix-Vector Multiplication on GPGPUs

Filippone

Cardellini

Barbieri

et al. 2017

ACM Trans. Math. Softw.

103

View full text Add to dashboard Cite

show abstract

“…To improve coalescing of matrix data accesses for matrices in the CRS representation, CUSP can virtually divide each warp into 2, 4, 8 or 16 smaller parts and assign them to different rows. Mukunoki and Takahashi used the same idea to optimize their CRS kernel for the Kepler GPU architecture [18]. Baskaran and Bordawekar [4] proposed a few other optimization techniques based on exploiting synchronization-free parallelism and optimized off-chip memory access.…”

Section: 3mentioning

confidence: 99%

Compressed Multirow Storage Format for Sparse Matrices on Graphics Processing Units

Koza¹,

Matyka²,

Szkoda³

et al. 2014

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

A new format for storing sparse matrices is proposed for efficient sparse matrix-vector (SpMV) product calculation on modern graphics processing units (GPUs). This format extends the standard compressed row storage (CRS) format and can be quickly converted to and from it. Computational performance of two SpMV kernels for the new format is determined for over 130 sparse matrices on Fermi-class and Kepler-class GPUs and compared with that of five existing generic algorithms and industrial implementations, including Nvidia cuSparse CSR and HYB kernels. We found the speedup of up to ≈ 60% over the best of the five alternative kernels.

show abstract