Mixed-precision quantized neural networks with progressively decreasing bitwidth

Chu, Tianshu; Luo, Qin; Yang, Jie; Huang, Xiaolin

doi:10.1016/j.patcog.2020.107647

Cited by 25 publications

(11 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chu et al [87] heuristically assign the wordlength of activations and weights of each layer based on the separability of their hierarchical distribution. However, for large datasets such as ImageNet, it is unaffordable to obtain a complete separability matrix.…”

Section: Mixed-precision Quantizationmentioning

confidence: 99%

“…Another notable drawback of the discussed techniques in [58], [72], [77], [83]- [87] is that they usually perform training repeatedly, which is highly inefficient and takes a lot of time to construct the quantized model [39]. Furthermore, training requires a full-size dataset, which is often unavailable in real-world scenarios for reasons such as proprietary and privacy, especially in the case when working on an offthe-shelf pre-trained model from a community or industry for which data is no longer accessible.…”

Section: Mixed-precision Quantizationmentioning

confidence: 99%

See 1 more Smart Citation

FxP-QNet: A Post-Training Quantizer for the Design of Mixed Low-Precision DNNs With Dynamic Fixed-Point Representation

2022

View full text Add to dashboard Cite

Deep neural networks (DNNs) have demonstrated their effectiveness in a wide range of computer vision tasks, with the state-of-the-art results obtained through complex and deep structures that require intensive computation and memory. In the past, graphic processing units enabled these breakthroughs because of their greater computational speed. Now-a-days, efficient model inference is crucial for consumer applications on resource-constrained platforms. As a result, there is much interest in the research and development of dedicated deep learning (DL) hardware to improve the throughput and energy efficiency of DNNs. Low-precision representation of DNN data-structures through quantization would bring great benefits to specialized DL hardware especially when expensive floating-point operations can be avoided and replaced by more efficient fixed-point operations. However, the rigorous quantization leads to a severe accuracy drop. As such, quantization opens a large hyper-parameter space at bit-precision levels, the exploration of which is a major challenge. In this paper, we propose a novel framework referred to as the Fixed-Point Quantizer of deep neural Networks (FxP-QNet) that flexibly designs a mixed low-precision DNN for integer-arithmetic-only deployment. Specifically, the FxP-QNet gradually adapts the quantization level for each data-structure of each layer based on the trade-off between the network accuracy and the low-precision requirements. Additionally, it employs post-training self-distillation and network prediction error statistics to optimize the quantization of floating-point values into fixed-point numbers. Examining FxP-QNet 1 on state-of-the-art architectures and the benchmark ImageNet dataset, we empirically demonstrate the effectiveness of FxP-QNet in achieving the accuracy-compression trade-off without the need for training. The results show that FxP-QNet-quantized AlexNet, VGG-16, and ResNet-18 reduce the overall memory requirements of their full-precision counterparts by 7.16ˆ, 10.36ˆ, and 6.44ˆwith less than 0.95%, 0.95%, and 1.99% accuracy drop, respectively.

show abstract

Section: Mixed-precision Quantizationmentioning

confidence: 99%

Section: Mixed-precision Quantizationmentioning

confidence: 99%

FxP-QNet: A Post-Training Quantizer for the Design of Mixed Low-Precision DNNs With Dynamic Fixed-Point Representation

2022

View full text Add to dashboard Cite

show abstract

“…Quantization [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26], as the name implies, is to let the weight and activation of the forward propagation calculation in the neural network and the 32-bit or 64-bit floating point number of the gradient value of the back propagation calculation are represented by low-bit floating point or fixed-point number, and can even be directly calculated. Figure 3 shows the basic idea of converting floating-point numbers into signed 8-bit fixed-point numbers.…”

Section: Model Quantizationmentioning

confidence: 99%

“…Model quantization [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26], as a means of compressing model, can be applied to model deployment, so that both the model size and the inference delay can be reduced. At present, the sizes of SR models become larger and larger.…”

Section: Introductionmentioning

confidence: 99%

Super-Resolution Model Quantized in Multi-Precision

et al. 2021

View full text Add to dashboard Cite

Deep learning has achieved outstanding results in various tasks in machine learning under the background of rapid increase in equipment’s computing capacity. However, while achieving higher performance and effects, model size is larger, training and inference time longer, the memory and storage occupancy increasing, the computing efficiency shrinking, and the energy consumption augmenting. Consequently, it’s difficult to let these models run on edge devices such as micro and mobile devices. Model compression technology is gradually emerging and researched, for instance, model quantization. Quantization aware training can take more accuracy loss resulting from data mapping in model training into account, which clamps and approximates the data when updating parameters, and introduces quantization errors into the model loss function. In quantization, we found that some stages of the two super-resolution model networks, SRGAN and ESRGAN, showed sensitivity to quantization, which greatly reduced the performance. Therefore, we use higher-bits integer quantization for the sensitive stage, and train the model together in quantization aware training. Although model size was sacrificed a little, the accuracy approaching the original model was achieved. The ESRGAN model was still reduced by nearly 67.14% and SRGAN model was reduced by nearly 68.48%, and the inference time was reduced by nearly 30.48% and 39.85% respectively. What’s more, the PI values of SRGAN and ESRGAN are 2.1049 and 2.2075 respectively.

show abstract

“…Another case is to improve the performance of the low-precision model to be closer to a 32-bit floating-point model. For instance, Chu et al [21] proposed a quantization method to progressively reduce the bit-width from the input to last layer. This method is realized from an observation that feature distributions in the shallow layer contain a low quantity of class separability while in the deeper layers, the distributions have a high quantity of class separability.…”

Section: Introductionmentioning

confidence: 99%

Mixed-precision weights network for field-programmable gate array

Fuengfusin

Tamukoh

2021

PLoS ONE

View full text Add to dashboard Cite

In this study, we introduced a mixed-precision weights network (MPWN), which is a quantization neural network that jointly utilizes three different weight spaces: binary {−1,1}, ternary {−1,0,1}, and 32-bit floating-point. We further developed the MPWN from both software and hardware aspects. From the software aspect, we evaluated the MPWN on the Fashion-MNIST and CIFAR10 datasets. We systematized the accuracy sparsity bit score, which is a linear combination of accuracy, sparsity, and number of bits. This score allows Bayesian optimization to be used efficiently to search for MPWN weight space combinations. From the hardware aspect, we proposed XOR signed-bits to explore floating-point and binary weight spaces in the MPWN. XOR signed-bits is an efficient implementation equivalent to multiplication of floating-point and binary weight spaces. Using the concept from XOR signed bits, we also provide a ternary bitwise operation that is an efficient implementation equivalent to the multiplication of floating-point and ternary weight space. To demonstrate the compatibility of the MPWN with hardware implementation, we synthesized and implemented the MPWN in a field-programmable gate array using high-level synthesis. Our proposed MPWN implementation utilized up to 1.68-4.89 times less hardware resources depending on the type of resources than a conventional 32-bit floating-point model. In addition, our implementation reduced the latency up to 31.55 times compared to 32-bit floating-point model without optimizations.

show abstract

Mixed-precision quantized neural networks with progressively decreasing bitwidth

Cited by 25 publications

References 4 publications

FxP-QNet: A Post-Training Quantizer for the Design of Mixed Low-Precision DNNs With Dynamic Fixed-Point Representation

FxP-QNet: A Post-Training Quantizer for the Design of Mixed Low-Precision DNNs With Dynamic Fixed-Point Representation

Super-Resolution Model Quantized in Multi-Precision

Mixed-precision weights network for field-programmable gate array

Contact Info

Product

Resources

About