Abstract:Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and fullprecision models. To reduce it, existing quantization approaches require highprecision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we prese… Show more
“…This poses a significant challenge for mobile devices in terms of computation and resource requirements. Our future work will enhance SoD 2 by combining it with the model pruning and quantization advances [27,47,64] to achieve an even better performance. Extending beyond ONNX.…”
Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution, are becoming common. This paper presents SoD 2 , a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains. Our evaluation results show that SoD 2 runs up to 3.9× faster than these systems while saving up to 88% peak memory consumption.
CCS CONCEPTS• Computing methodologies → Neural networks; • Software and its engineering → Source code generation; • Human-centered computing → Mobile computing.
“…This poses a significant challenge for mobile devices in terms of computation and resource requirements. Our future work will enhance SoD 2 by combining it with the model pruning and quantization advances [27,47,64] to achieve an even better performance. Extending beyond ONNX.…”
Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution, are becoming common. This paper presents SoD 2 , a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains. Our evaluation results show that SoD 2 runs up to 3.9× faster than these systems while saving up to 88% peak memory consumption.
CCS CONCEPTS• Computing methodologies → Neural networks; • Software and its engineering → Source code generation; • Human-centered computing → Mobile computing.
“…LogNN [24] and ShiftAddNet [37] do not conduct experiments on large-scale datasets such as Im-ageNet. S2FP8 [6] and LUQ [8] introduce extra multiplications in the quantization process, which increase the energy consumption as stated in [18].…”
Figure 1. Energy-Accuracy joint comparison. "Accuracy" refers to the accuracy results of training ResNet50 on ImageNet from scratch. "Energy Consumption" refers to the energy consumption of MACs for training ResNet50 on ImageNet at one iteration. Note that INQ and ShiftCNN apply their method to the pre-train models, so their training consumption is the same as full-precision training.
“…These data types have large errors for large magnitude values since they have only 2 bits for the fraction but provide high accuracy for small magnitude values. Jin et al (2022) provide an excellent analysis of when certain fixed point exponent/fraction bit widths are optimal for inputs with a particular standard deviation. We believe FP8 data types offer superior performance compared to the Int8 data type, but currently, neither GPUs nor TPUs support this data type.…”
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open source our software. * Majority of research done as a visiting researcher at Facebook AI Research. 2 Other parameters come mostly from the embedding layer. A tiny amount comes from norms and biases.Preprint. Under review.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.