Exploiting Retraining-Based Mixed-Precision Quantization for Low-Cost DNN Accelerator Design

Kim, Nahsung; Shin, Dongyeob; Choi, Wonseok; Kim, Geonho; Park, Jongsun

doi:10.1109/tnnls.2020.3008996

Cited by 18 publications

(11 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To avoid performance degradation of the quantized model, QAT is firstly proposed to retrain the quantized model [15]- [18]. With full training dataset, QAT performs floating-point forward and backward propagations on DNN models and quantizes them into low-bit after each training epoch.…”

Section: A Quantization Aware Training (Qat)mentioning

confidence: 99%

“…In particular, PACT [15] optimizes the clipping ranges of activations during model retraining. LSQ [17] learns step size as a model parameter and MPQ [18] exploits retraining-based mix-precision quantization. However, high computational complexity of QAT will lead to restrictions on the implementation in reality.…”

Section: A Quantization Aware Training (Qat)mentioning

confidence: 99%

“…Due to the limited representation ability over low-bit values, model quantization usually involves noise, which potentially results in the performance degradation in reality. To recover the quantized model performance, Quantization Aware Training (QAT) performs backward propagation to retrain the quantized model [15]- [18]. However, QAT is usually time-consuming and hard to implement, so Post Training Quantization (PTQ), as an alternative method, aims at adjusting the weights of quantized model without training [14], [22], [23].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ClusterQ: Semantic Feature Distribution Alignment for Data-Free Quantization

Gao¹,

Zhang²,

Hong³

et al. 2022

Preprint

View full text Add to dashboard Cite

Network quantization has emerged as a promising method for model compression and inference acceleration. However, tradtional quantization methods (such as quantization aware training and post training quantization) require original data for the fine-tuning or calibration of quantized model, which makes them inapplicable to the cases that original data are not accessed due to privacy or security. This gives birth to the data-free quantization with synthetic data generation. While current DFQ methods still suffer from severe performance degradation when quantizing a model into lower bit, caused by the low inter-class separability of semantic features. To this end, we propose a new and effective data-free quantization method termed ClusterQ, which utilizes the semantic feature distribution alignment for synthetic data generation. To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics to imitate the distribution of real data, so that the performance degradation is alleviated. Moreover, we incorporate the intra-class variance to solve class-wise mode collapse. We also employ the exponential moving average to update the centroid of each cluster for further feature distribution improvement. Extensive experiments across various deep models (e.g., ResNet-18 and MobileNet-V2) over the ImageNet dataset demonstrate that our ClusterQ obtains state-of-the-art performance.

show abstract

Section: A Quantization Aware Training (Qat)mentioning

confidence: 99%

Section: A Quantization Aware Training (Qat)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ClusterQ: Semantic Feature Distribution Alignment for Data-Free Quantization

Gao¹,

Zhang²,

Hong³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Quantization [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26], as the name implies, is to let the weight and activation of the forward propagation calculation in the neural network and the 32-bit or 64-bit floating point number of the gradient value of the back propagation calculation are represented by low-bit floating point or fixed-point number, and can even be directly calculated. Figure 3 shows the basic idea of converting floating-point numbers into signed 8-bit fixed-point numbers.…”

Section: Model Quantizationmentioning

confidence: 99%

“…Model quantization [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26], as a means of compressing model, can be applied to model deployment, so that both the model size and the inference delay can be reduced. At present, the sizes of SR models become larger and larger.…”

Section: Introductionmentioning

confidence: 99%

Super-Resolution Model Quantized in Multi-Precision

et al. 2021

View full text Add to dashboard Cite

Deep learning has achieved outstanding results in various tasks in machine learning under the background of rapid increase in equipment’s computing capacity. However, while achieving higher performance and effects, model size is larger, training and inference time longer, the memory and storage occupancy increasing, the computing efficiency shrinking, and the energy consumption augmenting. Consequently, it’s difficult to let these models run on edge devices such as micro and mobile devices. Model compression technology is gradually emerging and researched, for instance, model quantization. Quantization aware training can take more accuracy loss resulting from data mapping in model training into account, which clamps and approximates the data when updating parameters, and introduces quantization errors into the model loss function. In quantization, we found that some stages of the two super-resolution model networks, SRGAN and ESRGAN, showed sensitivity to quantization, which greatly reduced the performance. Therefore, we use higher-bits integer quantization for the sensitive stage, and train the model together in quantization aware training. Although model size was sacrificed a little, the accuracy approaching the original model was achieved. The ESRGAN model was still reduced by nearly 67.14% and SRGAN model was reduced by nearly 68.48%, and the inference time was reduced by nearly 30.48% and 39.85% respectively. What’s more, the PI values of SRGAN and ESRGAN are 2.1049 and 2.2075 respectively.

show abstract