Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing 2022
DOI: 10.1145/3502181.3531463
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Design Space Exploration for Sparse Mixed Precision Neural Architectures

Abstract: Pruning and Quantization are two effective Deep Neural Network (DNN) compression methods for efficient inference on various hardware platforms. Pruning refers to removing unimportant weights or nodes, whereas Quantization converts the floating-point parameters to low-bit fixed integer representation. The pruned and low precision models result in smaller and faster inference models on hardware platforms with almost the same accuracy as the unoptimized network. Tensor Cores in Nvidia Ampere 100 (A100) GPU suppor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 32 publications
0
2
0
Order By: Relevance
“…The search space pruning can also be extended to other benchmark models. Our previous work [28] has shown a similar trend where different precisions could exhibit similar latency. 2) Skip Connection: We did not consider the skip connection in our Architecture and Mixed Precision search space.…”
Section: Future Workmentioning
confidence: 56%
See 1 more Smart Citation
“…The search space pruning can also be extended to other benchmark models. Our previous work [28] has shown a similar trend where different precisions could exhibit similar latency. 2) Skip Connection: We did not consider the skip connection in our Architecture and Mixed Precision search space.…”
Section: Future Workmentioning
confidence: 56%
“…The automatically searched mixed quantized networks offer better latency accuracy tradeoff than Uniform Quantization on CIFAR [26] and ImageNet [27] datasets. The search method is partly taken from our previous work [28], which we applied to a different hardware platform (Nvidia A100 GPU) and Neural Network (ResNet50). However, we significantly contributed to developing a new search space pruning and weight/activation sharing method in this paper.…”
Section: Limitations Of the State-of-the-art (Sota)mentioning
confidence: 99%
“…We develop Mixed Sparse and Precision Search (MSPS) technique[Chitty-Venkata et al (2022a)] to search for efficient weight matrix (dense or sparse) and precision combination for every layer on a fixed pretrained model (Section 6.4). The automatically generated MSPS networks outperform Uniform 2:4 Sparse Int8 and 4 configured networks in terms of accuracy and latency on CIFAR[Krizhevsky et al (2009)] and ImageNet[Deng et al (2009)] datasets.3.…”
mentioning
confidence: 99%
“…The automatically generated MSPS networks outperform Uniform 2:4 Sparse Int8 and 4 configured networks in terms of accuracy and latency on CIFAR[Krizhevsky et al (2009)] and ImageNet[Deng et al (2009)] datasets.3. We extend MSPS and develop a technique to search for Neural Architecture, Sparsity pattern, and Precision (ASPS)[Chitty-Venkata et al (2022a)] to jointly optimize the macro-architecture (kernel size, number of filters) and sparse-precision combination of each layer (Section 6.5). The resulting ASPS outperforms both the baseline Uniform Sparse Int8…”
mentioning
confidence: 99%