F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

Jin, Qi; Ren, Jing; Zhuang, Richard; Hanumante, Sumant; Li, Zhengang; Chen, Zhiyu; Wang, Yanzhi; Yang, Kaiyuan; Tulyakov, Sergey

doi:10.48550/arxiv.2202.05239

Cited by 6 publications

(7 citation statements)

References 45 publications

(55 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This poses a significant challenge for mobile devices in terms of computation and resource requirements. Our future work will enhance SoD 2 by combining it with the model pruning and quantization advances [27,47,64] to achieve an even better performance. Extending beyond ONNX.…”

Section: Discussion and Future Workmentioning

confidence: 99%

SoD ² : Statically Optimizing Dynamic Deep Neural Network Execution

Niu,

Agrawal,

Ren

2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

View full text Add to dashboard Cite

Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution, are becoming common. This paper presents SoD 2 , a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains. Our evaluation results show that SoD 2 runs up to 3.9× faster than these systems while saving up to 88% peak memory consumption. CCS CONCEPTS• Computing methodologies → Neural networks; • Software and its engineering → Source code generation; • Human-centered computing → Mobile computing.

show abstract

Section: Discussion and Future Workmentioning

confidence: 99%

SoD ² : Statically Optimizing Dynamic Deep Neural Network Execution

Niu,

Agrawal,

Ren

2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

View full text Add to dashboard Cite

show abstract

“…LogNN [24] and ShiftAddNet [37] do not conduct experiments on large-scale datasets such as Im-ageNet. S2FP8 [6] and LUQ [8] introduce extra multiplications in the quantization process, which increase the energy consumption as stated in [18].…”

Section: Methodsmentioning

confidence: 99%

Ultra-low Precision Multiplication-free Training for Deep Neural Networks

Liu¹,

Zhang²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…These data types have large errors for large magnitude values since they have only 2 bits for the fraction but provide high accuracy for small magnitude values. Jin et al (2022) provide an excellent analysis of when certain fixed point exponent/fraction bit widths are optimal for inputs with a particular standard deviation. We believe FP8 data types offer superior performance compared to the Int8 data type, but currently, neither GPUs nor TPUs support this data type.…”

Section: Related Workmentioning

confidence: 99%

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Dettmers¹,

Lewis²,

Belkada³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open source our software. * Majority of research done as a visiting researcher at Facebook AI Research. 2 Other parameters come mostly from the embedding layer. A tiny amount comes from norms and biases.Preprint. Under review.

show abstract

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

Cited by 6 publications

References 45 publications

SoD ² : Statically Optimizing Dynamic Deep Neural Network Execution

SoD ² : Statically Optimizing Dynamic Deep Neural Network Execution

Ultra-low Precision Multiplication-free Training for Deep Neural Networks

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Contact Info

Product

Resources

About

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

Cited by 6 publications

References 45 publications

SoD 2 : Statically Optimizing Dynamic Deep Neural Network Execution

SoD 2 : Statically Optimizing Dynamic Deep Neural Network Execution

Ultra-low Precision Multiplication-free Training for Deep Neural Networks

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Contact Info

Product

Resources

About

SoD ² : Statically Optimizing Dynamic Deep Neural Network Execution

SoD ² : Statically Optimizing Dynamic Deep Neural Network Execution