Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

Ahn, Hyunho; Chen, Tian; Alnaasan, Nawras; Shafi, Aamir; Abduljabbar, Mustafa; Subramoni, Hari; Dhabaleswar, K.; Panda,

doi:10.48550/arxiv.2303.05016

Cited by 1 publication

(2 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Typically, lightweight networks feature broader numerical distribution ranges and fewer weights. The former leads to larger parameter quantization errors, exacerbating the discrepancy between the optimal solutions of Equations ( 7) and (8). The latter diminishes the efficacy of adaptive rounding, analogous to a consensus that fewer neural network parameters result in weaker fitting optimization capability.…”

Section: Comprehensive Comparisonmentioning

confidence: 99%

“…This reduction in data bit-width directly decreases power consumption and storage requirements and improves computational speed. For example, INT8-based quantized models deliver 3.3× and 4× better performance over FP32 using OpenVINO on Intel CPU and TFLite on Raspberry Pi device, respectively, for the MLPerf offline scenario [8]. Therefore, quantization is an exceptionally effective technique for model compression and acceleration.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

AE-Qdrop: Towards Accurate and Efficient Low-Bit Post-Training Quantization for A Convolutional Neural Network

Li,

Chen,

Jin

et al. 2024

Electronics

View full text Add to dashboard Cite

Blockwise reconstruction with adaptive rounding helps achieve acceptable 4-bit post-training quantization accuracy. However, adaptive rounding is time intensive, and the optimization space of weight elements is constrained to a binary set, thus limiting the performance of quantized models. The optimality of block-wise reconstruction requires that subsequent network blocks remain unquantized. To address this, we propose a two-stage post-training quantization scheme, AE-Qdrop, encompassing block-wise reconstruction and global fine-tuning. In the block-wise reconstruction stage, a progressive optimization strategy is introduced as a replacement for adaptive rounding, enhancing both quantization accuracy and efficiency. Additionally, the integration of randomly weighted quantized activation helps mitigate the risk of overfitting. In the global fine-tuning stage, the weights of each quantized network block are corrected simultaneously through logit matching and feature matching. Experiments in image classification and object detection tasks validate that AE-Qdrop achieves high precision and efficient quantization. For the 2-bit MobileNetV2, AE-Qdrop outperforms Qdrop in quantization accuracy by 6.26%, and its quantization efficiency is fivefold higher.

show abstract

Section: Comprehensive Comparisonmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%