2023
DOI: 10.48550/arxiv.2303.05016
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

Abstract: Quantization is a popular technique used in Deep Neural Networks (DNN) inference to reduce the size of models and improve the overall numerical performance by exploiting native hardware. This paper attempts to conduct an elaborate performance characterization of the benefits of using quantization techniques-mainly FP16/INT8 variants with static and dynamic schemes-using the MLPerf Edge Inference benchmarking methodology. The study is conducted on Intel x86 processors and Raspberry Pi device with ARM processor.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 8 publications
0
2
0
Order By: Relevance
“…Typically, lightweight networks feature broader numerical distribution ranges and fewer weights. The former leads to larger parameter quantization errors, exacerbating the discrepancy between the optimal solutions of Equations ( 7) and (8). The latter diminishes the efficacy of adaptive rounding, analogous to a consensus that fewer neural network parameters result in weaker fitting optimization capability.…”
Section: Comprehensive Comparisonmentioning
confidence: 99%
See 1 more Smart Citation
“…Typically, lightweight networks feature broader numerical distribution ranges and fewer weights. The former leads to larger parameter quantization errors, exacerbating the discrepancy between the optimal solutions of Equations ( 7) and (8). The latter diminishes the efficacy of adaptive rounding, analogous to a consensus that fewer neural network parameters result in weaker fitting optimization capability.…”
Section: Comprehensive Comparisonmentioning
confidence: 99%
“…This reduction in data bit-width directly decreases power consumption and storage requirements and improves computational speed. For example, INT8-based quantized models deliver 3.3× and 4× better performance over FP32 using OpenVINO on Intel CPU and TFLite on Raspberry Pi device, respectively, for the MLPerf offline scenario [8]. Therefore, quantization is an exceptionally effective technique for model compression and acceleration.…”
Section: Introductionmentioning
confidence: 99%