Neural Network Quantization for Efficient Inference: A Survey

Weng, Olivia

doi:10.48550/arxiv.2112.06126

Cited by 6 publications

(8 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Quantization has demonstrated excellent and consistent results when used during the training and inference using different NN models [39], [75]- [77]. Particularly, it is especially effective during inference because it saves computing resources without significantly decreasing accuracy.…”

Section: Quantizationmentioning

confidence: 98%

Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation

Freire

Napoli

Costa

et al. 2023

J. Lightwave Technol.

View full text Add to dashboard Cite

In this paper, a new methodology is proposed that allows for the low-complexity development of neural network (NN) based equalizers for the mitigation of impairments in highspeed coherent optical transmission systems. In this work, we provide a comprehensive description and comparison of various deep model compression approaches that have been applied to feed-forward and recurrent NN designs. Additionally, we evaluate the influence these strategies have on the performance of each NN equalizer. Quantization, weight clustering, pruning, and other cutting-edge strategies for model compression are taken into consideration. In this work, we propose and evaluate a Bayesian optimization-assisted compression, in which the hyperparameters of the compression are chosen to simultaneously reduce complexity and improve performance. Next, this paper presents four distinct metrics (RMpS, BoP, NABS, and NLGs) that are discussed here that can be used to evaluate the amount of computing complexity required by various compression algorithms. These measurements can serve as a benchmark for evaluating the relative effectiveness of various NN equalizers when compression approaches are used. In conclusion, the trade-off between the complexity of each compression approach and its performance is evaluated by utilizing both simulated and experimental data in order to complete the analysis. By utilizing optimal compression approaches, we show that it is possible to design an NN-based equalizer that is simpler to implement and has better performance than the conventional digital back-propagation (DBP) equalizer with only one step per span. This is accomplished by reducing the number of multipliers used in the NN equalizer after applying the weighted clustering and pruning algorithms. Furthermore, we demonstrate that an equalizer based on NN can also achieve superior performance while still maintaining the same degree of complexity as the full electronic chromatic dispersion compensation block. We conclude our analysis by highlighting open questions and existing challenges, as well as possible future research directions.

show abstract

Section: Quantizationmentioning

confidence: 98%

Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation

Freire

Napoli

Costa

et al. 2023

J. Lightwave Technol.

View full text Add to dashboard Cite

show abstract

“…Neural network quantization facilitates the efficient deployment of deep neural networks (DNNs) on resourceconstrained platforms, such as drones and Internet-of-Things (IoT) devices [1], [2]. During inference, MAC operations consisting of multiplications and accumulations dominate the arithmetic cost.…”

Section: Introductionmentioning

confidence: 99%

CANET: Quantized Neural Network Inference With 8-bit Carry-Aware Accumulator

Yang,

Wang,

Jiang

2024

IEEE Access

View full text Add to dashboard Cite

Neural network quantization represents weights and activations with few bits, greatly reducing the overhead of multiplications. However, due to the recursive accumulation operations, high-precision accumulators are still required in multiply-accumulate (MAC) units to avoid overflow, incurring significant computational overhead. This constraint limits the efficient deployment of quantized NNs on resourceconstrained platforms. To address this problem, we present a novel framework named CANET, which adapts the 8-bit quantized model to execute MAC operations with 8-bit accumulators. CANET not only employs 8bit carry-aware accumulators to represent overflow data correctly, but also adaptively learns the optimal format per layer to minimize truncation errors. Meanwhile, a weight-oriented reordering method is developed to reduce the transfer length of the carry. CANET is evaluated on three networks in the ImageNet classification task, where comparable performance with state-of-the-art methods is realized. Finally, we implement the proposed architecture on a custom hardware platform, demonstrating a reduction of 40% in power and 49% in area compared with the MAC unit with 32-bit accumulators.

show abstract

“…As their names suggest, PTQ quantizes model parameters after NN training, and QAT -during NN training. PTQ is suitable for applications where speed and simplicity of quantization are preferred to the extreme bit precision reduction, while QAT shall be chosen for low-precision quantization with negligible accuracy degradation [5].…”

Section: Introductionmentioning

confidence: 99%

“…QAT usually results in lower accuracy loss than PTQ. However, training a quantized model from scratch increases the quantization time [9]. In PTQ, one can choose any pre-trained model and perform quantization much faster; however, getting good performance in low-bit precision is difficult [9].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Training-aware Low Precision Quantization in Spiking Neural Networks

Shymyrbay

Fouda

2022

2022 56th Asilomar Conference on Signals, Systems, and Computers

View full text Add to dashboard Cite

Spiking neural networks (SNNs) have become an attractive alternative to conventional artificial neural networks (ANN) due to their temporal information processing capability, energy efficiency, and high biological plausibility. Yet, their computational and memory costs still restrict them from being widely deployed on portable devices. The quantization of SNNs, which converts the full-precision synaptic weights into low-bit versions, emerged as one of the solutions. The development of quantization techniques is far more advanced in the ANN domain compared to the SNN domain. In this work, we utilize the concept of one of the promising ANN quantization methods called Learned Step Size Quantization (LSQ) to adapt to SNN. Furthermore, we extend the mentioned technique for binary quantization of SNNs. Our analysis shows that the proposed method for SNN quantization yields a negligible drop in accuracy and a significant reduction in the needed memory.

show abstract

Neural Network Quantization for Efficient Inference: A Survey

Cited by 6 publications

References 25 publications

Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation

Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation

CANET: Quantized Neural Network Inference With 8-bit Carry-Aware Accumulator

Training-aware Low Precision Quantization in Spiking Neural Networks

Contact Info

Product

Resources

About