A Survey of Quantization Methods for Efficient Neural Network Inference

Gholami, Amir; Kim, Sehoon; Dong, Zhen; Yao, Zhewei; Mahoney, Michael W.; Keutzer, Kurt

doi:10.48550/arxiv.2103.13630

Cited by 126 publications

(202 citation statements)

References 0 publications

Supporting

Mentioning

132

Contrasting

Order By: Relevance

“…It is so complex that there is an IEEE standard for how real numbers should be represented as well as how arithmetic on this real number representation should work-IEEE Standard 754, also known as IEEE floating point. Since there are infinitely many real numbers and only so many bits that can be allocated for representing each number on machines, we can actually view representing real numbers on computers as a quantization problem itself because we are reducing the precision of the reals [14].…”

Section: Representing Numbers On Machinesmentioning

confidence: 99%

“…Enabling neural network inference in resource-constrained settings is important so that NNs can solve problems like speech recognition, autonomous driving, and image classification in IoT devices, vehicles, and more. To realize this, neural network inference must achieve 1) real-time latency, 2) low energy consumption, and 3) high accuracy [14].…”

Section: Introductionmentioning

confidence: 99%

“…There exists a wrinkle in the problem of NN quantization: NNs are often over-parameterized and can thus afford to lose precision with minimal to no loss in accuracy [14]. Since one of the primary overarching goals in NN quantization is maintaining accuracy, we are not particularly wedded to quantizing weights such that their quantized values are as close to their floating point counterparts as permitted by the quantization scheme.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Neural Network Quantization for Efficient Inference: A Survey

Weng¹

2021

Preprint

View full text Add to dashboard Cite

As neural networks have become more powerful, there has been a rising desire to deploy them in the real world; however, the power and accuracy of neural networks is largely due to their depth and complexity, making them difficult to deploy, especially in resource-constrained devices. Neural network quantization has recently arisen to meet this demand of reducing the size and complexity of neural networks by reducing the precision of a network. With smaller and simpler networks, it becomes possible to run neural networks within the constraints of their target hardware. This paper surveys the many neural network quantization techniques that have been developed in the last decade. Based on this survey and comparison of neural network quantization techniques, we propose future directions of research in the area.

show abstract

Section: Representing Numbers On Machinesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Neural Network Quantization for Efficient Inference: A Survey

Weng¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This hinders the deployment of DNNs to resourcelimited applications. Therefore, model compression without significant performance degradation is an important active area of deep learning research [11,25,6,10]. One prominent approach to compression is quantization.…”

Section: Introductionmentioning

confidence: 99%

“…In fact, once we get the lower bound of E Xt,u 2 Xt 2 2 as in (10), the quantization error for unbounded data ( 14) can be derived similarly to the proof of Theorem 2.1, albeit using different techniques. It follows from the Cauchy-Schwarz inequality that…”

mentioning

confidence: 99%

Post-training Quantization for Neural Networks with Provable Guarantees

Zhang¹,

Zhou²,

Saab³

2022

Preprint

View full text Add to dashboard Cite

While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. We modify a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism, and rigorously analyze its error. We prove that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -i.e., level of overparametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.

show abstract

A Framework for Benchmarking Real-Time Embedded Object Detection

Schlosser

König

Teutsch

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Object detection is one of the key tasks in many applications of computer vision. Deep Neural Networks (DNNs) are undoubtedly a well-suited approach for object detection. However, such DNNs need highly adapted hardware together with hardware-specific optimization to guarantee high efficiency during inference. This is especially the case when aiming for efficient object detection in video streaming applications on limited hardware such as edge devices. Comparing vendorspecific hardware and related optimization software pipelines in a fair experimental setup is a challenge. In this paper, we propose a framework that uses a host computer with a host software application together with a light-weight interface based on the Message Queuing Telemetry Transport (MQTT) protocol. Various different target devices with target apps can be connected via MQTT with this host computer. With well-defined and standardized MQTT messages, object detection results can be reported to the host computer, where the results are evaluated without harming or influencing the processing on the device. With this quite generic framework, we can measure the object detection performance, the runtime, and the energy efficiency at the same time. The effectiveness of this framework is demonstrated in multiple experiments that offer deep insights into the optimization of DNNs.

show abstract

A Survey of Quantization Methods for Efficient Neural Network Inference

Cited by 126 publications

References 0 publications

Neural Network Quantization for Efficient Inference: A Survey

Neural Network Quantization for Efficient Inference: A Survey

Post-training Quantization for Neural Networks with Provable Guarantees

A Framework for Benchmarking Real-Time Embedded Object Detection

Contact Info

Product

Resources

About