2021
DOI: 10.3389/frai.2021.676564
|View full text |Cite
|
Sign up to set email alerts
|

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Abstract: Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for u… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 29 publications
(30 citation statements)
references
References 28 publications
0
30
0
Order By: Relevance
“…On the other hand, math-intensive tensor operations executed on INT8 types can see up to a 16× speed-up compared to the same operations in FP32. Moreover, memory-limited operations could see up to a 4× speed-up compared to the FP32 version [22][23][24]41 . Therefore, in addition to pruning, we will reduce the precision of the weights and activations to further decrease the computational complexity of the equalizer, employing the technique known as integer quantization 41 .…”
Section: Quantization Techniquementioning
confidence: 99%
See 3 more Smart Citations
“…On the other hand, math-intensive tensor operations executed on INT8 types can see up to a 16× speed-up compared to the same operations in FP32. Moreover, memory-limited operations could see up to a 4× speed-up compared to the FP32 version [22][23][24]41 . Therefore, in addition to pruning, we will reduce the precision of the weights and activations to further decrease the computational complexity of the equalizer, employing the technique known as integer quantization 41 .…”
Section: Quantization Techniquementioning
confidence: 99%
“…The quantization process can occur after the training or during it. The first case is known as post-training quantization (PTQ) and the second one is the quantization aware training [22][23][24] . In PTQ, a trained model has its weight and activations quantified.…”
Section: Quantization Techniquementioning
confidence: 99%
See 2 more Smart Citations
“…4. Quantization-aware training [55,[59][60][61][62][63][64][65][66][67][68] using QKeras [56,69] or BreviTas [50,70], parameter pruning [71][72][73][74][75][76], and general hardware-algorithm codesign can significantly reduce the necessary FPGA resources by reducing the required bit precision and removing irrelevant operations. 5.…”
Section: Inference Timingmentioning
confidence: 99%