Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference

Shen, Jianghao; Fu, Yonggan; Wang, Yue; Xu, Pengfei; Wang, Zhangyang; Lin, Yingyan

doi:10.48550/arxiv.2001.00705

Cited by 2 publications

(4 citation statements)

References 19 publications

(4 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A systematic approach to find the correct precision for each layer has been shown in (Wang et al, 2019;Dong et al, 2019;Cai et al, 2020). Dynamic multi-granularity for tensors is also considered as a way of computation saving (Shen et al, 2020). Several quantization schemes have been proposed for training (Wu et al, 2018b;Banner et al, 2018;Das et al, 2018;De Sa et al, 2018;Park et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

“…Therefore, mixed-precision DNN accelerators that support versatility in data types are crucial and sometimes mandatory to exploit the benefit of different software optimizations (e.g., low-bit quantization). Moreover, supporting versatility in data types can be leveraged to trade off accuracy for efficiency based on the available resources (Shen et al, 2020). Typically, mixedprecision accelerators are designed based on low precision arithmetic units, and higher precision operation can be supported by fusing the low precision arithmetic units temporally or spatially.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators

Abdel-Aziz,

Shafiee,

Shin

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a mixed-precision convolution unit architecture which supports different integer and floating point (FP) precisions. The proposed architecture is based on low-bit inner product units and realizes higher precision based on temporal decomposition. We illustrate how to integrate FP computations on integer-based architecture and evaluate overheads incurred by FP arithmetic support. We argue that alignment and addition overhead for FP inner product can be significant since the maximum exponent difference could be up to 58 bits, which results into a large alignment logic. To address this issue, we illustrate empirically that no more than 26-bit product bits are required and up to 8-bit of alignment is sufficient in most inference cases. We present novel optimizations based on the above observations to reduce the FP arithmetic hardware overheads. Our empirical results, based on simulation and hardware implementation, show significant reduction in FP16 overhead. Over typical mixed precision implementation, the proposed architecture achieves area improvements of up to 25% in TFLOPS/mm 2 and up to 46% in TOPS/mm 2 with power efficiency improvements of up to 40% in TFLOPS/W and up to 63% in TOPS/W.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators

Abdel-Aziz,

Shafiee,

Shin

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Inspired by [37] and following [30], we calculate the computational cost of DNNs using the effective number of MACs, i.e., (# of M ACs) * Bit a /32 * Bit b /32 for a dot product between a and b, where Bit a and Bit b denote the precision of a and b, respectively. As such, this metric is proportional to the total number of bit operations.…”

Section: Design Of Pfqmentioning

confidence: 99%

“…Dynamic/efficient DNN training. More recently dynamic inference [23,9,24,25,26,27,28,29] was developed to reduce the average inference cost, which was then extended to the most fine-grained bit level [30,31]. While energy-efficient training is more complicated than and different from inference, many insights of the latter can be lent to the former.…”

Section: Introduction 2 Prior Workmentioning

confidence: 99%

FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training

Fu¹,

You²,

Zhao³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Recent breakthroughs in deep neural networks (DNNs) have fueled a tremendous demand for intelligent edge devices featuring on-site learning, while the practical realization of such systems remains a challenge due to the limited resources available at the edge and the required massive training costs for state-of-the-art (SOTA) DNNs. As reducing precision is one of the most effective knobs for boosting training time/energy efficiency, there has been a growing interest in low-precision DNN training. In this paper, we explore from an orthogonal direction: how to fractionally squeeze out more training cost savings from the most redundant bit level, progressively along the training trajectory and dynamically per input. Specifically, we propose FracTrain that integrates (i) progressive fractional quantization which gradually increases the precision of activations, weights, and gradients that will not reach the precision of SOTA static quantized DNN training until the final training stage, and (ii) dynamic fractional quantization which assigns precisions to both the activations and gradients of each layer in an input-adaptive manner, for only "fractionally" updating layer parameters. Extensive simulations and ablation studies (six models, four datasets, and three training settings including standard, adaptation, and fine-tuning) validate the effectiveness of FracTrain in reducing computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12% ∼ +1.87%) accuracy. For example, when training ResNet-74 on CIFAR-10, FracTrain achieves 77.6% and 53.5% computational cost and training latency savings, respectively, compared with the best SOTA baseline, while achieving a comparable (-0.07%) accuracy. Our codes are available at: https://github.com/RICE-EIC/FracTrain.Recent breakthroughs in deep neural networks (DNNs) have motivated an explosive demand for intelligent edge devices. Many of them, such as autonomous vehicles and healthcare wearables, require real-time and on-site learning to enable them to proactively learn from new data and adapt to dynamic environments. The challenge for such on-site learning is that the massive and growing cost of state-of-the-art (SOTA) DNNs stands at odds with the limited resources available at the edge devices, raising a major concern even when training in cloud using powerful GPUs/CPUs [1, 2]. To address the above challenge towards efficient DNN training, low-precision training have been developed recognizing that the training time/energy efficiency is a quadratic function of DNNs' adopted precision [3]. While they have showed promising training efficiency, they all adopt (i) a static quantization strategy, i.e., the precisions are fixed during the whole training process; (ii) 34th Conference on Neural Information Processing Systems (NeurIPS 2020),

show abstract

Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference

Cited by 2 publications

References 19 publications

Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators

Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators

FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training

Contact Info

Product

Resources

About