Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming 2019
DOI: 10.1145/3293883.3295712
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive sparse tiling for sparse matrix multiplication

Abstract: Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern of sparse matrix multiplication makes it challenging to use tiling to enhance data reuse. In this paper, we devise an adaptive tiling strategy and apply it to enhance the performance of two primitives: SpMM (product of sparse matrix and dense matrix) and SDDMM (s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
77
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 102 publications
(78 citation statements)
references
References 44 publications
(37 reference statements)
1
77
0
Order By: Relevance
“…Third, it introduces load imbalance as different kernels are likely to have different workloads. In fact, the throughput of cuBLAS's single-precision GEMM can be up to 8,000 GFLOPS [2] on an Nvidia P100 GPU, whereas the throughput of the state-of-theart implementation of single-precision SpMM is about 800 GFLOPS on the same device [23,25]. Because of the huge throughput gap, it is a major challenge to improve the performance of CNN inference on GPUs while retaining its accuracy.…”
Section: Performance Challenges With Cnn Pruningmentioning
confidence: 99%
See 4 more Smart Citations
“…Third, it introduces load imbalance as different kernels are likely to have different workloads. In fact, the throughput of cuBLAS's single-precision GEMM can be up to 8,000 GFLOPS [2] on an Nvidia P100 GPU, whereas the throughput of the state-of-theart implementation of single-precision SpMM is about 800 GFLOPS on the same device [23,25]. Because of the huge throughput gap, it is a major challenge to improve the performance of CNN inference on GPUs while retaining its accuracy.…”
Section: Performance Challenges With Cnn Pruningmentioning
confidence: 99%
“…As explained in the background section, sparse convolutions can be implemented as SpMM. Although previous works have studied SpMM on GPUs [22,23,25], their optimization techniques mainly target large sparse matrices with at least 10,000 rows and columns that are found in scientific computing applications, and they can not deliver good performance for sparse convolutions where the number of convolution kernels is usually smaller than 1000. In fact, we adopted a state-of-the-art implementation of SpMM from [25] for sparse convolution with real-world pruned models from [46], and we found that the sparse convolutions do not run much faster (and can even be slower) compared with the original dense convolutions implemented as GEMM.…”
Section: Implementing Sparse Convolutions With Gemmmentioning
confidence: 99%
See 3 more Smart Citations