Adaptive sparse tiling for sparse matrix multiplication

Hong, Changwan; Sukumaran-Rajam, Aravind; Nisa, Israt; Singh, Kunal; Sadayappan, P.

doi:10.1145/3293883.3295712

Cited by 102 publications

(78 citation statements)

References 44 publications

(37 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Third, it introduces load imbalance as different kernels are likely to have different workloads. In fact, the throughput of cuBLAS's single-precision GEMM can be up to 8,000 GFLOPS [2] on an Nvidia P100 GPU, whereas the throughput of the state-of-theart implementation of single-precision SpMM is about 800 GFLOPS on the same device [23,25]. Because of the huge throughput gap, it is a major challenge to improve the performance of CNN inference on GPUs while retaining its accuracy.…”

Section: Performance Challenges With Cnn Pruningmentioning

confidence: 99%

“…As explained in the background section, sparse convolutions can be implemented as SpMM. Although previous works have studied SpMM on GPUs [22,23,25], their optimization techniques mainly target large sparse matrices with at least 10,000 rows and columns that are found in scientific computing applications, and they can not deliver good performance for sparse convolutions where the number of convolution kernels is usually smaller than 1000. In fact, we adopted a state-of-the-art implementation of SpMM from [25] for sparse convolution with real-world pruned models from [46], and we found that the sparse convolutions do not run much faster (and can even be slower) compared with the original dense convolutions implemented as GEMM.…”

Section: Implementing Sparse Convolutions With Gemmmentioning

confidence: 99%

“…After the sparse matrix is divided into multiple dense blocks and a sparse block, we use GEMM for the computation with the dense blocks and use a row-wise implementation of SpMM [23] for the computation with the sparse block. The CUDA code can be easily generated with a GEMM-based implementation of dense convolutions and an SpMM kernel.…”

Section: Extracting Dense Blocks In the Sparse Kernel Matrixmentioning

confidence: 99%

“…If an element in the removed row happens to be the only element in its column, then the performance benefit is increased by 2 . For the sparse block, with the row-wise SpMM implementation [23], the results in each row are accumulated in registers with perfect data reuse, but there is little data reuse when accessing the input among the rows. If we remove a weight in the sparse block, we save the read of a row from the input matrix, and the benefit is 2 .…”

Section: A Simplified Performance Modelmentioning

confidence: 99%

“…However, existing SpMM implementations cannot deliver satisfactory performance for sparse convolutions. The reason is that these implementations mainly target large sparse matrices where data locality can be effectively improved by data reorganization [22,23,25], whereas the sparse matrices of the pruned convolution kernels are small and have less freedom of data reorganization. We observe that the main performance bottleneck in small sparse matrix multiplications is the control-flow instructions instead of data locality.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Rumi

Wang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Weight pruning is a popular technique to reduce the size and computation complexity of the Convolutional Neural Networks (CNNs). Despite its success in reducing the model size, weight pruning has brought limited benefit to the CNN inference performance, due to the irregularity introduced in the sparse convolution operations. In this work, we aim to improve the performance of sparse convolutions on GPUs by mitigating the irregularity. We find that the existing performance optimization techniques for sparse matrix computations fail to accelerate sparse convolutions, and we observe that the main performance bottleneck is caused by the heavy control-flow instructions. Based on the observation, we proposed a new GEMM-based implementation of sparse convolutions. Our main idea is to extract dense blocks of non-zeros in the sparse convolution kernels, and use dense matrix-matrix multiplication for these dense blocks to achieve high throughput. For cases where many non-zero weights cannot be grouped into dense blocks, we propose a performance-aware re-pruning strategy that removes the least important weights in the sparse kernels to further improve the throughput. The experimental results with five real-world pruned CNN models show that our techniques can significantly improve the layer-wise performance of sparse convolution operations as well as the end-to-end performance of CNN inference. CCS CONCEPTS • Computing methodologies → Neural networks; • Software and its engineering → Source code generation;

show abstract

Section: Performance Challenges With Cnn Pruningmentioning

confidence: 99%

Section: Implementing Sparse Convolutions With Gemmmentioning

confidence: 99%

Section: Extracting Dense Blocks In the Sparse Kernel Matrixmentioning

confidence: 99%

Section: A Simplified Performance Modelmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Rumi

Wang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

Improving Locality-Aware Scheduling with Acyclic Directed Graph Partitioning

Benoît

Çatalyürek

2020

Parallel Processing and Applied Mathematics

View full text Add to dashboard Cite

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Tsai

Cojean

Anzt

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best sparse matrix-vector product (SpMV) implementations on high-end GPUs from AMD and NVIDIA. Specifically, we optimize SpMV kernels for the CSR, COO, ELL, and HYB format taking the hardware characteristics of the latest GPU technologies into account. We compare for 2,800 test matrices the performance of our kernels against AMD's hipSPARSE library and NVIDIA's cuSPARSE library, and ultimately assess how the GPU technologies from AMD and NVIDIA compare in terms of SpMV performance.

show abstract

Adaptive sparse tiling for sparse matrix multiplication

Cited by 102 publications

References 44 publications

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Improving Locality-Aware Scheduling with Acyclic Directed Graph Partitioning

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Contact Info

Product

Resources

About