The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2022
DOI: 10.48550/arxiv.2202.05239
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

Abstract: Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and fullprecision models. To reduce it, existing quantization approaches require highprecision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we prese… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 45 publications
(55 reference statements)
0
6
0
Order By: Relevance
“…This poses a significant challenge for mobile devices in terms of computation and resource requirements. Our future work will enhance SoD 2 by combining it with the model pruning and quantization advances [27,47,64] to achieve an even better performance. Extending beyond ONNX.…”
Section: Discussion and Future Workmentioning
confidence: 99%
“…This poses a significant challenge for mobile devices in terms of computation and resource requirements. Our future work will enhance SoD 2 by combining it with the model pruning and quantization advances [27,47,64] to achieve an even better performance. Extending beyond ONNX.…”
Section: Discussion and Future Workmentioning
confidence: 99%
“…LogNN [24] and ShiftAddNet [37] do not conduct experiments on large-scale datasets such as Im-ageNet. S2FP8 [6] and LUQ [8] introduce extra multiplications in the quantization process, which increase the energy consumption as stated in [18].…”
Section: Methodsmentioning
confidence: 99%
“…These data types have large errors for large magnitude values since they have only 2 bits for the fraction but provide high accuracy for small magnitude values. Jin et al (2022) provide an excellent analysis of when certain fixed point exponent/fraction bit widths are optimal for inputs with a particular standard deviation. We believe FP8 data types offer superior performance compared to the Int8 data type, but currently, neither GPUs nor TPUs support this data type.…”
Section: Related Workmentioning
confidence: 99%