Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation

Liu, Zechun; Cheng, Kwang-Ting; Huang, Dong; Xing, Eric P.; Shen, Zhiqiang

doi:10.48550/arxiv.2111.14826

Cited by 1 publication

(1 citation statement)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These methods focus on quantizing most, if not all network layers to the same uniform bit-width. While this has been shown to be effective for recovering full precision accuracy for higher bit widths, using extremely low precision still leads to significant accuracy degradation (Courbariaux et al, 2015;Esser et al, 2015;Rastegari et al, 2016;Zhou et al, 2016;McKinstry et al, 2019;Esser et al, 2020;Liu et al, 2021c). To further push the envelope of maximizing throughput and minimizing memory footprint while maintaining task performance, mixed precision quantization methods have emerged with the goal of optimizing the bit-width of each layer independently to maximize overall network performance (Dong et al, 2019;Yao et al, 2021;Chen et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

Bablani¹,

Mckinstry²,

Esser³

et al. 2023

Preprint

View full text Add to dashboard Cite

For effective and efficient deep neural network inference, it is desirable to achieve state-of-theart accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. It is generally desirable to quantize as aggressively as possible without incurring significant accuracy degradation. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers of a network to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50 and ResNet-101 classification networks, demonstrating improved performance across the entire accuracy-throughput frontier, and equivalent performance for the PSPNet segmentation network in our own commensurate comparison over leading mixed precision layer selection techniques, while requiring orders of magnitude less compute time to reach a solution.

show abstract