Post-training Piecewise Linear Quantization for Deep Neural Networks

Fang, Jun; Shafiee, Ali; Abdel-Aziz, Hamzah; Thorsley, David; Georgiadis, Georgios; Hassoun, Joseph

doi:10.1007/978-3-030-58536-5_5

Cited by 90 publications

(56 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Relying on an abundance of the previous conclusions about quantization for traditional network solutions [ 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 ], further improvements in the field of NNs, especially in NNs intended for edge devices, can be intuitively driven by the prudent application of post-training quantization. Post-training quantization is especially convenient as there is no need for retraining NN, while the memory size required for storing the weights of the quantized neural network (QNN) model can be significantly reduced compared to the baseline NN model utilizing 32-bit floating-point (FP32) format [ 6 , 14 , 15 , 19 , 33 ].…”

Section: Introductionmentioning

confidence: 99%

“…Namely, an important challenge in post-training quantization is that it can lead to significant performance degradation, especially in ultra-low precision settings. To cope with this, inspired by the conclusions from classical quantization, numerous papers have addressed the problem of minimizing the inevitable post-training quantization error (see, for instance, [ 6 , 12 , 15 , 33 ]).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Whether the Support Region of Three-Bit Uniform Quantizer Has a Strong Impact on Post-Training Quantization for MNIST Dataset?

Nikolić

Perić

Aleksić

et al. 2021

Entropy

View full text Add to dashboard Cite

Driven by the need for the compression of weights in neural networks (NNs), which is especially beneficial for edge devices with a constrained resource, and by the need to utilize the simplest possible quantization model, in this paper, we study the performance of three-bit post-training uniform quantization. The goal is to put various choices of the key parameter of the quantizer in question (support region threshold) in one place and provide a detailed overview of this choice’s impact on the performance of post-training quantization for the MNIST dataset. Specifically, we analyze whether it is possible to preserve the accuracy of the two NN models (MLP and CNN) to a great extent with the very simple three-bit uniform quantizer, regardless of the choice of the key parameter. Moreover, our goal is to answer the question of whether it is of the utmost importance in post-training three-bit uniform quantization, as it is in quantization, to determine the optimal support region threshold value of the quantizer to achieve some predefined accuracy of the quantized neural network (QNN). The results show that the choice of the support region threshold value of the three-bit uniform quantizer does not have such a strong impact on the accuracy of the QNNs, which is not the case with two-bit uniform post-training quantization, when applied in MLP for the same classification task. Accordingly, one can anticipate that due to this special property, the post-training quantization model in question can be greatly exploited.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Whether the Support Region of Three-Bit Uniform Quantizer Has a Strong Impact on Post-Training Quantization for MNIST Dataset?

Nikolić

Perić

Aleksić

et al. 2021

Entropy

View full text Add to dashboard Cite

show abstract

“…It is worth noticing, however, in recent years, various quantization solution have been proposed, specifically, posttraining static quantization [21]- [23] that does not involve retraining of the neural network. In the work by Fang et al [21], the authors proposed to split the weight distribution into multiple regions. For each region, the weights are then quantized with their respective scaling factors to convert to their respective integer ranges.…”

Section: B Motivation: Why Another Level Of Quantization?mentioning

confidence: 99%

“…However, owing to the requirement of the proposed scheme, each region of the weight distribution represents a separate computation path, due to the differences in their scaling factors. This requirement, as commented by the authors [21], specified that at least three or more accumulators are required based on the number of regions the weight distribution is split into. The number of accumulators (at least three) tied to each multiply and accumulate (MAC) processing element (PE) might require more hardware resource for implementation, not to mention that existing CNN accelerators usually implements large number of PE in parallel to achieve high performance computation.…”

Section: B Motivation: Why Another Level Of Quantization?mentioning

confidence: 99%

DoubleQExt: Hardware and Memory Efficient CNN Through Two Levels of Quantization

See

Ng²,

Tan

et al. 2021

IEEE Access

View full text Add to dashboard Cite

To fulfil the tight area and memory constraints in IoT applications, the design of efficient Convolutional Neural Network (CNN) hardware becomes crucial. Quantization of CNN is one of the promising approach that allows the compression of large CNN into a much smaller one, which is very suitable for IoT applications. Among various proposed quantization schemes, Power-of-two (PoT) quantization enables efficient hardware implementation and small memory consumption for CNN accelerators, but requires retraining of CNN to retain its accuracy. This paper proposes a two-level post-training static quantization technique (DoubleQ) that combines the 8-bit and PoT weight quantization. The CNN weight is first quantized to 8-bit (level one), then further quantized to PoT (level two). This allows multiplication to be carried out using shifters, by expressing the weights in their PoT exponent form. DoubleQ also reduces the memory storage requirement for CNN, as only the exponent of the weights is needed for storage. However, DoubleQ trades the accuracy of the network for reduced memory storage. To recover the accuracy, a selection process (DoubleQExt) was proposed to strategically select some of the less informative layers in the network to be quantized with PoT at the second level. On ResNet-20, the proposed DoubleQ can reduce the memory consumption by 37.50% with 7.28% accuracy degradation compared to 8-bit quantization. By applying DoubleQExt, the accuracy is only degraded by 1.19% compared to 8-bit version while achieving a memory reduction of 23.05%. This result is also 1% more accurate than the state-of-the-art work (SegLog). The proposed DoubleQExt also allows flexible configuration to trade off the memory consumption with better accuracy, which is not found in the other state-of-the-art works. With the proposed two-level weight quantization, one can achieve a more efficient hardware architecture for CNN with minimal impact to the accuracy, which is crucial for IoT applications.

show abstract

“…[22]. Post-training quantization methods [47][48][49][50] avoid these limitations by searching for the optimal tensor-cutting values to reduce quantization noise after the network model has been trained.…”

Section: Prior Workmentioning

confidence: 99%

Compression of Neural Networks for Specialized Tasks via Value Locality

Gabbay

Shomron

2021

Mathematics

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) are broadly used in numerous applications such as computer vision and image classification. Although CNN models deliver state-of-the-art accuracy, they require heavy computational resources that are not always affordable or available on every platform. Limited performance, system cost, and energy consumption, such as in edge devices, argue for the optimization of computations in neural networks. Toward this end, we propose herein the value-locality-based compression (VELCRO) algorithm for neural networks. VELCRO is a method to compress general-purpose neural networks that are deployed for a small subset of focused specialized tasks. Although this study focuses on CNNs, VELCRO can be used to compress any deep neural network. VELCRO relies on the property of value locality, which suggests that activation functions exhibit values in proximity through the inference process when the network is used for specialized tasks. VELCRO consists of two stages: a preprocessing stage that identifies output elements of the activation function with a high degree of value locality, and a compression stage that replaces these elements with their corresponding average arithmetic values. As a result, VELCRO not only saves the computation of the replaced activations but also avoids processing their corresponding output feature map elements. Unlike common neural network compression algorithms, which require computationally intensive training processes, VELCRO introduces significantly fewer computational requirements. An analysis of our experiments indicates that, when CNNs are used for specialized tasks, they introduce a high degree of value locality relative to the general-purpose case. In addition, the experimental results show that without any training process, VELCRO produces a compression-saving ratio in the range 13.5–30.0% with no degradation in accuracy. Finally, the experimental results indicate that, when VELCRO is used with a relatively low compression target, it significantly improves the accuracy by 2–20% for specialized CNN tasks.

show abstract

Post-training Piecewise Linear Quantization for Deep Neural Networks

Cited by 90 publications

References 36 publications

Whether the Support Region of Three-Bit Uniform Quantizer Has a Strong Impact on Post-Training Quantization for MNIST Dataset?

Whether the Support Region of Three-Bit Uniform Quantizer Has a Strong Impact on Post-Training Quantization for MNIST Dataset?

DoubleQExt: Hardware and Memory Efficient CNN Through Two Levels of Quantization

Compression of Neural Networks for Specialized Tasks via Value Locality

Contact Info

Product

Resources

About