Abstract:Quantization techniques applied to the inference of deep neural networks have enabled fast and efficient execution on resource-constraint devices. The success of quantization during inference has motivated the academic community to explore fully quantized training, i.e. quantizing backpropagation as well. However, effective gradient quantization is still an open problem. Gradients are unbounded and their distribution changes significantly during training, which leads to the need for dynamic quantization. As we… Show more
“…Also, Sun et al ( 2019) presented a novel hybrid format for full training in FP8, while the weights and activations are quantized to [1,4,3] format, the neural gradients are quantized to [1,5,2] format to catch a wider dynamic range. Fournarakis & Nagel (2021) suggested a method to reduce the data traffic during the calculation of the quantization range.…”
Section: Related Workmentioning
confidence: 99%
“…In addition, this issue might be solved with dedicated hardware such as a unit that calculates the statistics value more efficiently or use memory-on-chip blocks which reduce data-movement overhead. A recent method (Fournarakis & Nagel, 2021) tries to reduce the data movement by using previous iterations statistic, but as shown in Fig. 5a in the appendix, combining it with LUQ cause accuracy degradation.…”
Section: Future Directionsmentioning
confidence: 99%
“…This statistics measurement increase the data movement from and to memory, in a similar way than previous suggested methods (Sun et al, 2020). Recently, Fournarakis & Nagel (2021) suggests "in-hindsight" method to reduce the data movement overhead that occurs in the calculation on-the-fly of the quantization ranges by using a running average of the previous iterations statistics. In Fig.…”
Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. In this work, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how. Based on this, we suggest a logarithmic unbiased quantization (LUQ) method to quantize both the forward and backward phase to 4-bit, achieving state-of-the-art results in 4-bit training without overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.18%. We further improve this to degradation of only 0.64% after a single epoch of high precision fine-tuning combined with a variance reduction method-both add overhead comparable to previously suggested methods. Finally, we suggest a method that uses the low precision format to avoid multiplications during two-thirds of the training process, thus reducing by 5x the area used by the multiplier.
“…Also, Sun et al ( 2019) presented a novel hybrid format for full training in FP8, while the weights and activations are quantized to [1,4,3] format, the neural gradients are quantized to [1,5,2] format to catch a wider dynamic range. Fournarakis & Nagel (2021) suggested a method to reduce the data traffic during the calculation of the quantization range.…”
Section: Related Workmentioning
confidence: 99%
“…In addition, this issue might be solved with dedicated hardware such as a unit that calculates the statistics value more efficiently or use memory-on-chip blocks which reduce data-movement overhead. A recent method (Fournarakis & Nagel, 2021) tries to reduce the data movement by using previous iterations statistic, but as shown in Fig. 5a in the appendix, combining it with LUQ cause accuracy degradation.…”
Section: Future Directionsmentioning
confidence: 99%
“…This statistics measurement increase the data movement from and to memory, in a similar way than previous suggested methods (Sun et al, 2020). Recently, Fournarakis & Nagel (2021) suggests "in-hindsight" method to reduce the data movement overhead that occurs in the calculation on-the-fly of the quantization ranges by using a running average of the previous iterations statistics. In Fig.…”
Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. In this work, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how. Based on this, we suggest a logarithmic unbiased quantization (LUQ) method to quantize both the forward and backward phase to 4-bit, achieving state-of-the-art results in 4-bit training without overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.18%. We further improve this to degradation of only 0.64% after a single epoch of high precision fine-tuning combined with a variance reduction method-both add overhead comparable to previously suggested methods. Finally, we suggest a method that uses the low precision format to avoid multiplications during two-thirds of the training process, thus reducing by 5x the area used by the multiplier.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.