2021
DOI: 10.48550/arxiv.2104.08378
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Accelerating Sparse Deep Neural Networks

Abstract: As neural network model sizes have dramatically increased, so has the interest in various techniques to reduce their parameter counts and accelerate their execution. An active area of research in this field is sparsity -encouraging zero values in parameters that can then be discarded from storage or computations. While most research focuses on high levels of sparsity, there are challenges in universally maintaining model accuracy as well as achieving significant speedups over modern matrix-math hardware. To ma… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
55
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(56 citation statements)
references
References 23 publications
0
55
0
Order By: Relevance
“…The recently introduced NVIDIA Ampere GPU architecture supports acceleration of sparse matrix multiplication with a specific pattern of 2:4 sparsity (2 of the 4 consecutive weight elements are zero, see Figure 3). This comes with a limitation of requiring the input and output dimensions of all linear projections to be divisible by 16 (Mishra et al, 2021). We assure compatibility with such pattern by structurally pruning matrices to have the remaining dimension be divisible by 16 (more details in Appendix A.2).…”
Section: Global Importance Rankingmentioning
confidence: 99%
See 2 more Smart Citations
“…The recently introduced NVIDIA Ampere GPU architecture supports acceleration of sparse matrix multiplication with a specific pattern of 2:4 sparsity (2 of the 4 consecutive weight elements are zero, see Figure 3). This comes with a limitation of requiring the input and output dimensions of all linear projections to be divisible by 16 (Mishra et al, 2021). We assure compatibility with such pattern by structurally pruning matrices to have the remaining dimension be divisible by 16 (more details in Appendix A.2).…”
Section: Global Importance Rankingmentioning
confidence: 99%
“…The model size-accuracy tradeoff also outperforms previous model compression methods like SViTE and AutoFormer by a large margin. Since our pruning scheme supports the utilization of Ampere sparsity on advanced GPU architectures, with the help of Apex ASP (Mishra et al, 2021), an additional 5% speedup can be achieved on our pruned models without further accuracy loss. show for the first time that our pruning method can serve as an effective architecture search tool for ViT models, and more interestingly the inferred design rules are scalable to different model sizes.…”
Section: Pruning Analysis On Imagenet-1kmentioning
confidence: 99%
See 1 more Smart Citation
“…NVIDIA also recently introduced weight sparsity acceleration in its Ampere microarchitecture [17,19]. The Sparse TC (STC) hardware achieves 2× speedup over the original TC by essentially skipping 50% of the computations (Figure 5).…”
Section: Tensor Coresmentioning
confidence: 99%
“…Weight pruning is, arguably, the compression method with the richest history [35] and is currently a very active research topic [25]. Thanks to this trend, a set of fairly consistent accuracy benchmarks has emerged for pruning, along with increasingly efficient computational support [9,18,33,43].…”
Section: Introductionmentioning
confidence: 99%