Load-balancing Sparse Matrix Vector Product Kernels on GPUs

Anzt, Hartwig; Cojean, Terry; Chen, Yen‐Chen; Dongarra, Jack; Flegar, Goran; Nayak, Pratik; Tomov, Stanimire; Tsai, Yu‐Hsiang; Wang, Weichung

doi:10.1145/3380930

Cited by 33 publications

(33 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given the different hardware characteristics, see Table 1, we optimize kernel parameters like group size for the distinct architectures. More relevant, for the CSR, ELL, and HYB kernels, we modify the SpMV execution strategy for the AMD architecture from the strategy that was previously realized for NVIDIA architectures [2].…”

Section: Sparse Matrix Vector Kernel Designsmentioning

confidence: 99%

“…In Algorithm 2, we assign a "subwarp" (multiple threads) to each row, and use warp reduction mechanisms to accumulate the partial results before writing to the output vector. This classical CSR assigning multiple threads to each row is inspired by the performance improvement of the ELL SpMV in [2]. We adjust the number of threads assigned to each row to the maximum number of nonzeros in a row.…”

Section: Csr Spmv Kernelmentioning

confidence: 99%

“…In [2], the authors demonstrated that the ELL SpMV kernel can be accelerated by assigning multiple threads to each row, and using an "early stopping" strategy to terminate thread blocks early if they reach the padding part of the ELL format. Porting this strategy to AMD architectures, we discovered that the non-coalesced global memory access possible when assigning multiple threads to the rows of the ELL matrix stored in column-major format can result in low performance.…”

Section: Ell Spmv Kernelmentioning

confidence: 99%

“…Porting this strategy to AMD architectures, we discovered that the non-coalesced global memory access possible when assigning multiple threads to the rows of the ELL matrix stored in column-major format can result in low performance. The reason behind this is that the strategy in [2] uses threads of the same group to handle one row, which results in adjacent threads always reading matrix elements that are m (matrix size or stride) memory locations apart. To overcome this problem, we rearrange the memory access by assigning the threads of the same group to handle one column like the classical ELL kernel, but assigning several groups to each row to increase GPU usage.…”

Section: Ell Spmv Kernelmentioning

confidence: 99%

“…Given the long list of efforts covering the design and evaluation of SpMV kernels on manycore processors, see [2,7] for a recent and comprehensive overview of SpMV research, we highlight that this work contains the following novel contributions:…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Tsai

Cojean

Anzt

2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best sparse matrix-vector product (SpMV) implementations on high-end GPUs from AMD and NVIDIA. Specifically, we optimize SpMV kernels for the CSR, COO, ELL, and HYB format taking the hardware characteristics of the latest GPU technologies into account. We compare for 2,800 test matrices the performance of our kernels against AMD's hipSPARSE library and NVIDIA's cuSPARSE library, and ultimately assess how the GPU technologies from AMD and NVIDIA compare in terms of SpMV performance.

show abstract

Section: Sparse Matrix Vector Kernel Designsmentioning

confidence: 99%

Section: Csr Spmv Kernelmentioning

confidence: 99%

Section: Ell Spmv Kernelmentioning

confidence: 99%

Section: Ell Spmv Kernelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Tsai

Cojean

Anzt

2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

Using Ginkgo's memory accessor for improving the accuracy of memory‐bound low precision BLAS

2021

Self Cite

View full text Add to dashboard Cite

The roofline model not only provides a powerful tool to relate an application's performance with the specific constraints imposed by the target hardware but also offers a graphic representation of the balance between memory access cost and compute throughput. In this work, we present a strategy to break up the tight coupling between the precision format used for arithmetic operations and the storage format employed for memory operations. (At a high level, this idea is equivalent to compressing/decompressing the data in registers before/after invoking store/load memory operations.) In practice, we demonstrate that a “memory accessor” that hides the data compression behind the memory access, can virtually push the bandwidth‐induced roofline, yielding higher performance for memory‐bound applications using high precision arithmetic that can handle the numerical effects associated with lossy compression. We also demonstrate that memory‐bound applications operating on low precision data can increase the accuracy by relying on the memory accessor to perform all arithmetic operations in high precision. In particular, we demonstrate that memory‐bound BLAS operations (including the sparse matrix‐vector product) can be re‐engineered with the memory accessor and that the resulting accessor‐enabled BLAS routines achieve lower rounding errors while delivering the same performance as the fast low precision BLAS.

show abstract