Training High-Performance and Large-Scale Deep Neural Networks with Full 8-bit Integers

Yang, Yukuan; Wu, Shuang; Deng, Lei; Yan, Tianyi; Xie, Yuan; Li, Guoqi

doi:10.48550/arxiv.1909.02384

Cited by 6 publications

(6 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DFQ unifies two efficient DNN training mindsets, i.e., dynamic selective layer update and static low-precision training, and enables a "fractional" quantization of layers during training, in contrast to either a full execution (selected) or complete non-execution (bypassed) of layers. Furthermore, DFQ introduces input-adaptive quantization at training for the first time, and automatically learns to adapt the precision of different layers' activations and gradients in contrast to current practice of low-precision training [14,22,37] that fixes layer-wise precision during training regardless of inputs.…”

Section: Design Of Pfqmentioning

confidence: 99%

“…We next evaluate FracTrain over three SOTA low-precision training baselines including SBM [14], DoReFa [37], and WAGEUBN [22]. Here we consider standard training settings.…”

Section: Fractrain Over Sota Low-precision Trainingmentioning

confidence: 99%

“…[40], MobileNetV2 [41], and Transformer-base [42]) and four datasets (i.e., CIFAR-10/100 [43], ImageNet [44], and WikiText-103 [45]). Baselines: We evaluate FracTrain against three SOTA static low-precision training techniques, including SBM [14], WAGEUBN [22],…”

Section: Design Of Pfqmentioning

confidence: 99%

“…Second, low-precision training achieves a better trade-off between accuracy and efficiency towards on-device learning. For example, [3,17] and [14,22] introduced an 8-bit floating-point data format and a 8-bit integer representation to reduce training cost, respectively. FracTrain explores from an orthogonal perspective, and can be applied on top of them to further boost the training efficiency.…”

Section: Introduction 2 Prior Workmentioning

confidence: 99%

See 3 more Smart Citations

FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training

Fu¹,

You²,

Zhao³

et al. 2020

Preprint

View full text Add to dashboard Cite

Recent breakthroughs in deep neural networks (DNNs) have fueled a tremendous demand for intelligent edge devices featuring on-site learning, while the practical realization of such systems remains a challenge due to the limited resources available at the edge and the required massive training costs for state-of-the-art (SOTA) DNNs. As reducing precision is one of the most effective knobs for boosting training time/energy efficiency, there has been a growing interest in low-precision DNN training. In this paper, we explore from an orthogonal direction: how to fractionally squeeze out more training cost savings from the most redundant bit level, progressively along the training trajectory and dynamically per input. Specifically, we propose FracTrain that integrates (i) progressive fractional quantization which gradually increases the precision of activations, weights, and gradients that will not reach the precision of SOTA static quantized DNN training until the final training stage, and (ii) dynamic fractional quantization which assigns precisions to both the activations and gradients of each layer in an input-adaptive manner, for only "fractionally" updating layer parameters. Extensive simulations and ablation studies (six models, four datasets, and three training settings including standard, adaptation, and fine-tuning) validate the effectiveness of FracTrain in reducing computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12% ∼ +1.87%) accuracy. For example, when training ResNet-74 on CIFAR-10, FracTrain achieves 77.6% and 53.5% computational cost and training latency savings, respectively, compared with the best SOTA baseline, while achieving a comparable (-0.07%) accuracy. Our codes are available at: https://github.com/RICE-EIC/FracTrain.Recent breakthroughs in deep neural networks (DNNs) have motivated an explosive demand for intelligent edge devices. Many of them, such as autonomous vehicles and healthcare wearables, require real-time and on-site learning to enable them to proactively learn from new data and adapt to dynamic environments. The challenge for such on-site learning is that the massive and growing cost of state-of-the-art (SOTA) DNNs stands at odds with the limited resources available at the edge devices, raising a major concern even when training in cloud using powerful GPUs/CPUs [1, 2]. To address the above challenge towards efficient DNN training, low-precision training have been developed recognizing that the training time/energy efficiency is a quadratic function of DNNs' adopted precision [3]. While they have showed promising training efficiency, they all adopt (i) a static quantization strategy, i.e., the precisions are fixed during the whole training process; (ii) 34th Conference on Neural Information Processing Systems (NeurIPS 2020),

show abstract

Section: Design Of Pfqmentioning

confidence: 99%

“…We next evaluate FracTrain over three SOTA low-precision training baselines including SBM [14], DoReFa [37], and WAGEUBN [22]. Here we consider standard training settings.…”

Section: Fractrain Over Sota Low-precision Trainingmentioning

confidence: 99%

Section: Design Of Pfqmentioning

confidence: 99%

Section: Introduction 2 Prior Workmentioning

confidence: 99%

See 2 more Smart Citations

FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training

Fu¹,

You²,

Zhao³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Recent works have explored training DNNs with reduced precisions in floating-point arithmetic domain such as bfloat16 [40], float8 [41] as well as fixed-point arithmetic domain [13], [42]. While floating-point arithmetic is not amenable to ReRam-based hardware (without modifications), the reductions in fixed-point precision can be exploited in PANTHER by reducing the MCU width (number of slices) to improve training energy and time.…”

Section: Related Workmentioning

confidence: 99%

PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-efficient ReRAM

Ankit¹,

Hajj²,

Chalamalasetti³

et al. 2019

Preprint

View full text Add to dashboard Cite

The wide adoption of deep neural networks has been accompanied by ever-increasing energy and performance demands due to the expensive nature of training them. Numerous special-purpose architectures have been proposed to accelerate training: both digital and hybrid digital-analog using resistive RAM (ReRAM) crossbars. ReRAM-based accelerators have demonstrated the effectiveness of ReRAM crossbars at performing matrix-vector multiplication operations that are prevalent in training. However, they still suffer from inefficiency due to the use of serial reads and writes for performing the weight gradient and update step. A few works have demonstrated the possibility of performing outer products in crossbars, which can be used to realize the weight gradient and update step without the use of serial reads and writes. However, these works have been limited to low precision operations which are not sufficient for typical training workloads. Moreover, they have been confined to a limited set of training algorithms for fully-connected layers only.To address these limitations, we propose a bit-slicing technique for enhancing the precision of ReRAM-based outer products, which is substantially different from bit-slicing for matrix-vector multiplication only. We incorporate this technique into a crossbar architecture with three variants catered to different training algorithms. To evaluate our design on different types of layers in neural networks (fully-connected, convolutional, etc.) and training algorithms, we develop PANTHER, an ISA-programmable training accelerator with compiler support. Our design can also be integrated into other accelerators in the literature to enhance their efficiency. Our evaluation shows that PANTHER achieves up to 8.02×, 54.21×, and 103× energy reductions as well as 7.16×, 4.02×, and 16× execution time reductions compared to digital accelerators, ReRAM-based accelerators, and GPUs, respectively.

show abstract

Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks

You¹,

Li²,

Xu³

et al. 2019

Preprint

View full text Add to dashboard Cite

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-bit Integers

Cited by 6 publications

References 18 publications

FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training

FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training

PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-efficient ReRAM

Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks

Contact Info

Product

Resources

About