2021
DOI: 10.48550/arxiv.2105.11010
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Post-Training Sparsity-Aware Quantization

Abstract: Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Mapping FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, du… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 28 publications
0
3
0
Order By: Relevance
“…According to [54], the usage of low-precision fixed integer value representation has the potential to reduce the memory footprint and latency by a factor of 16×. Despite being fast and very easy to use, the post-training quantization approach is not an option because it suffers from significant degradation in model accuracy in case of precision lower than 8 bits [55].…”
Section: Model Quantizationmentioning
confidence: 99%
“…According to [54], the usage of low-precision fixed integer value representation has the potential to reduce the memory footprint and latency by a factor of 16×. Despite being fast and very easy to use, the post-training quantization approach is not an option because it suffers from significant degradation in model accuracy in case of precision lower than 8 bits [55].…”
Section: Model Quantizationmentioning
confidence: 99%
“…Model quantization is an optimization technique that aims at transforming the higher-bit level weights to lower-bit level weights, e.g., from float32 weights to 8-bit integer weights, to reduce the size of the model for an easy model deployment. Multiple quantization approaches [32,33,38,53] have been proposed given its importance in DL-based engineering. An important part of quantization methods is the mapping between the two parts of weights.…”
Section: Model Quantizationmentioning
confidence: 99%
“…To reduce the data exchange bandwidth and on-chip storage size requirements for CNN inference accelerators, one straightforward method is to compress the generated interlayer data simultaneously during inference. Many works reported quantizing the activation into low-precision fixed-point data to reduce the interlayer data size [11,25,26,54,55]. However, the lowprecision activation will significantly degrade the prediction accuracy of CNNs if the quantization is straightforwardly implemented on interlayer activations without retraining [27,28].…”
Section: Introductionmentioning
confidence: 99%