Accelerating Sparse Deep Neural Networks

Mishra, Asit K.; Latorre, Jorge Albericio; Pool, Jeff; Stošić, Darko; Stošić, Dušan; Venkatesh, Ganesh; Yu, Chong Ho; Micikevicius, Paulius

doi:10.48550/arxiv.2104.08378

Cited by 21 publications

(56 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The recently introduced NVIDIA Ampere GPU architecture supports acceleration of sparse matrix multiplication with a specific pattern of 2:4 sparsity (2 of the 4 consecutive weight elements are zero, see Figure 3). This comes with a limitation of requiring the input and output dimensions of all linear projections to be divisible by 16 (Mishra et al, 2021). We assure compatibility with such pattern by structurally pruning matrices to have the remaining dimension be divisible by 16 (more details in Appendix A.2).…”

Section: Global Importance Rankingmentioning

confidence: 99%

“…The model size-accuracy tradeoff also outperforms previous model compression methods like SViTE and AutoFormer by a large margin. Since our pruning scheme supports the utilization of Ampere sparsity on advanced GPU architectures, with the help of Apex ASP (Mishra et al, 2021), an additional 5% speedup can be achieved on our pruned models without further accuracy loss. show for the first time that our pruning method can serve as an effective architecture search tool for ViT models, and more interestingly the inferred design rules are scalable to different model sizes.…”

Section: Pruning Analysis On Imagenet-1kmentioning

confidence: 99%

“…Latency of NVP are estimated on a single GPU with batch size 256. "ASP" stands for post-training 2:4 Ampere sparsity pruning, enabling 2× parameter reduction and 2× throughput on linear operations with TensorRT(Mishra et al, 2021).4.1 TRENDS OBSERVED IN VIT PRUNINGAs observed byLiu et al (2018), channel/filter pruning in CNN models can provide guidance on finding efficient network architectures, yet this has never been explored on ViT models. Here we…”

mentioning

confidence: 99%

See 2 more Smart Citations

NViT: Vision Transformer Compression and Parameter Redistribution

Yang¹,

Yin²,

Shen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformers yield state-of-the-art results across many tasks. However, they impose huge computational costs during inference. We apply global structural pruning with latency-aware regularization on all parameters of the Vision Transformer (ViT) model for latency reduction. Furthermore, we analyze the pruned architectures and find interesting regularities in the final weight structure. Our discovered insights lead to a new architecture called NViT (Novel ViT), with a redistribution of where parameters are used. This architecture utilizes parameters more efficiently and enables control of the latency-accuracy trade-off. On ImageNet-1K, we prune the DEIT-Base model to a 2.6× FLOPs reduction, 5.1× parameter reduction, and 1.9× run-time speedup with merely 0.07% loss in accuracy. We achieve more than 1% accuracy gain when compressing the base model to the throughput of the Small/Tiny variants. NViT gains 0.1-1.1% accuracy over the hand-designed DEIT family when trained from scratch, while being faster. * Work done during an internship at NVIDIA.

show abstract

Section: Global Importance Rankingmentioning

confidence: 99%

Section: Pruning Analysis On Imagenet-1kmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

NViT: Vision Transformer Compression and Parameter Redistribution

Yang¹,

Yin²,

Shen³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…NVIDIA also recently introduced weight sparsity acceleration in its Ampere microarchitecture [17,19]. The Sparse TC (STC) hardware achieves 2× speedup over the original TC by essentially skipping 50% of the computations (Figure 5).…”

Section: Tensor Coresmentioning

confidence: 99%

Post-Training Sparsity-Aware Quantization

Shomron¹,

Gabbay²,

Kurzum³

et al. 2021

Preprint

View full text Add to dashboard Cite

Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Mapping FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, due to the increase in quantization noise. In this paper, we propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. 4-bit quantization, for example, is employed by dynamically examining the bits of 8-bit values and choosing a window of 4 bits, while first skipping zero-value bits. Moreover, instead of quantizing activation-by-activation to 4 bits, we focus on pairs of 8-bit activations and examine whether one of the two is equal to zero. If one is equal to zero, the second can opportunistically use the other's 4-bit budget; if both do not equal zero, then each is dynamically quantized to 4 bits, as described. SPARQ achieves minor accuracy degradation, 2× speedup over widely used hardware architectures, and a practical hardware implementation. The code is available at https://github.com/gilshm/sparq.Preprint. Under review.

show abstract

“…Weight pruning is, arguably, the compression method with the richest history [35] and is currently a very active research topic [25]. Thanks to this trend, a set of fairly consistent accuracy benchmarks has emerged for pruning, along with increasingly efficient computational support [9,18,33,43].…”

Section: Introductionmentioning

confidence: 99%

How Well Do Sparse Imagenet Models Transfer?

Iofinova¹,

Peşte²,

Kurtz³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream," specialized datasets. Generally, it is understood that more accurate models on the "upstream" dataset will provide better transfer accuracy "downstream". In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset, which have been pruned-that is, compressed by sparsifiying their connections. Specifically, we consider transfer using unstructured pruned models obtained by applying several state-of-the-art pruning methods, including magnitude-based, second-order, re-growth and regularization approaches, in the context of twelve standard transfer tasks. In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups. At the same time, we observe and analyze significant differences in the behaviour of different pruning methods.

show abstract

Accelerating Sparse Deep Neural Networks

Cited by 21 publications

References 23 publications

NViT: Vision Transformer Compression and Parameter Redistribution

NViT: Vision Transformer Compression and Parameter Redistribution

Post-Training Sparsity-Aware Quantization

How Well Do Sparse Imagenet Models Transfer?

Contact Info

Product

Resources

About