2013
DOI: 10.1007/978-3-642-39640-3_15
|View full text |Cite
|
Sign up to set email alerts
|

Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2014
2014
2017
2017

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 8 publications
0
7
0
Order By: Relevance
“…We also acknowledge Dr. Giles for having informed us know that their code in [87] is now part of NVIDIA's cuSPARSE library and Dr. Mukunoki for having provided us their code in [79]. 39 186 70 68 65 cop20k A 23 138 21 47 46 FEM 3D thermal2 28 169 57 55 59 thermal2 23 127 26 33 35 thermomech TK 25 125 23 39 34 nlpkkt80 27 179 59 59 62 pde100 25 176 48 47 56 webbase-1M 10 25 -11 54 dc1 1 3 -4 537 amazon0302 18 111 26 32 44 roadNet-CA 14 123 16…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…We also acknowledge Dr. Giles for having informed us know that their code in [87] is now part of NVIDIA's cuSPARSE library and Dr. Mukunoki for having provided us their code in [79]. 39 186 70 68 65 cop20k A 23 138 21 47 46 FEM 3D thermal2 28 169 57 55 59 thermal2 23 127 26 33 35 thermomech TK 25 125 23 39 34 nlpkkt80 27 179 59 59 62 pde100 25 176 48 47 56 webbase-1M 10 25 -11 54 dc1 1 3 -4 537 amazon0302 18 111 26 32 44 roadNet-CA 14 123 16…”
Section: Discussionmentioning
confidence: 99%
“…Mukunoki et al [79] also proposed optimization techniques for the CSR format but on the more recent NVIDIA Kepler architecture, taking advantage of three new features: 48KB read-only data cache, shuffle instructions, and expanding the number of thread blocks in the x-direction that can be defined in a grid.…”
Section: Csr Optimizationsmentioning
confidence: 99%
See 2 more Smart Citations
“…To improve coalescing of matrix data accesses for matrices in the CRS representation, CUSP can virtually divide each warp into 2, 4, 8 or 16 smaller parts and assign them to different rows. Mukunoki and Takahashi used the same idea to optimize their CRS kernel for the Kepler GPU architecture [18]. Baskaran and Bordawekar [4] proposed a few other optimization techniques based on exploiting synchronization-free parallelism and optimized off-chip memory access.…”
Section: 3mentioning
confidence: 99%