Invited: Co-Design of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications

Kwon, Kiseok; Amid, Alon; Gholami, Amir; Wu, Bichen; Asanović, Krste; Keutzer, Kurt

doi:10.1109/dac.2018.8465901

Cited by 16 publications

(9 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The choice of datalow for a systolic architecture has a signiicant impact on its performance and energy-eiciency [7,37]. Since FuSeConv is a systolic algorithm, avoids im2col transformations, and is entirely composed of 1D convolutions, it is at an advantage in being mapped to systolic arrays.…”

Section: Systolic-array With St-os Dataflowmentioning

confidence: 99%

Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

Ganesan

Kumar

2022

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Massively parallel systolic arrays and resource-efficient depthwise separable convolutions are two promising hardware and software techniques to accelerate DNN inference on the edge. Interestingly, their combination is inefficient: Computational patterns of depthwise separable convolutions do not exhibit a rhythmic systolic flow and lack sufficient data reuse to saturate systolic arrays. In this paper, we formally analyse this inefficiency and propose an efficient operator, an optimal hardware dataflow, and a superior training methodology towards alleviating this. The efficient operator, called Fully-Separable Convolutions (FuSeConv) 1, is a drop-in replacement for depthwise-separable convolutions. FuSeConv generalizes factorization of convolution fully along their spatial and depth dimensions. The resultant computation is systolic and efficiently maps to systolic arrays. The optimal hardware dataflow, called Spatial-Tiled Output Stationary (ST-OS), maximizes the efficiency of FuSeConv on systolic arrays. It maps independent convolutions to rows of the systolic array to maximize resource-utilization with negligible VLSI overheads. Neural Operator Scaffolding (NOS) scaffolds the training of FuSeConv operators by distilling knowledge from the more expensive depthwise separable convolution operation. This bridges the accuracy gap between FuSeConv networks and networks with depthwise-separable convolutions. Additionally, NOS can be combined with Neural Architecture Search (NAS) to trade-off latency and accuracy. The hardware-software co-design of FuSeConv with ST-OS achieves a significant speedup of 4.1 − 9.25 × with state-of-the-art efficient networks for the ImageNet dataset. The parameter efficiency of FuSeConv and its significant superiority over depthwise-separable convolutions on systolic arrays illustrates their promise as a strong solution on the edge. Training FuSeConv networks with NOS achieves accuracy comparable to the depthwise-separable convolution baselines. Further, by combining NOS with NAS, we design networks that define state-of-the-art models improving on both accuracy and latency for computer vision on systolic arrays.

show abstract

Section: Systolic-array With St-os Dataflowmentioning

confidence: 99%

Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

Ganesan

Kumar

2022

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…a) Efficient Neural Network: Several different approaches to reduce the memory footprint, latency, and power of modern neural network (NN) architectures. These techniques can be broadly categorized into (1) model pruning [18,31,35,38,40,67], (2) knowledge distillation [21,39,43,49,70], (3) efficient neural architecture design [23,24,37,51,57], (4) hardware and neural architecture co-design [16,17,22,29,64], and (5) quantization [5,7,8,14,15,27,34,48,60,66,72,73].…”

Section: Related Workmentioning

confidence: 99%

I-BERT: Integer-only BERT Quantization

Kim¹,

Gholaminejad²,

Yao³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for many edge processors, and it has been a challenge to deploy these models for edge applications and devices that have resource constraints. While quantization can be a viable solution to this, previous work on quantizing Transformer based models uses floating-point arithmetic during inference, thus limiting model deployment on many edge processors. In this work, we propose a novel integer-only quantization scheme for Transformer based models that quantizes the entire inference process. In particular, we demonstrate how to approximate nonlinear operations in Transformer architectures, e.g., GELU, Softmax, and Layer Normalization, with lightweight integer computations. We use those approximations in our method, I-BERT, with an end-to-end integer-only inference, and without any floating point calculation. We test our approach on GLUE downstream tasks using RoBERTa-Base and RoBERTa-Large. For both cases, with an 8bit integer-only quantization scheme, I-BERT achieves similar accuracy as compared to the full-precision baseline.

show abstract

“…The choice of dataflow for a systolic architecture has a significant impact on its performance and energy-efficiency [7,35].…”

Section: Systolic-array With St-os Dataflowmentioning

confidence: 99%

Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

Ganesan¹,

Kumar²

2021

Preprint

View full text Add to dashboard Cite

Massively parallel systolic arrays and resource-efficient depthwise separable convolutions are two promising hardware and software techniques to accelerate DNN inference on the edge. Interestingly, their combination is inefficient: Computational patterns of depthwise separable convolutions do not exhibit a rhythmic systolic flow and lack sufficient data reuse to saturate systolic arrays. In this paper, we formally analyse this inefficiency and propose an efficient operator, an optimal hardware dataflow, and a superior training methodology towards alleviating this. The efficient operator, called Fully-Separable Convolutions (FuSeConv) 1 , is a drop-in replacement for depthwise-separable convolutions. FuSeConv generalizes factorization of convolution fully along their spatial and depth dimensions.The resultant computation is systolic and efficiently maps to systolic arrays. The optimal hardware dataflow, called Spatial-Tiled Output Stationary (ST-OS), maximizes the efficiency of FuSeConv on systolic arrays. It maps independent convolutions to rows of the systolic array to maximize resource-utilization with negligible VLSI overheads. Neural Operator Scaffolding (NOS) scaffolds the training of FuSeConv operators by distilling knowledge from the more expensive depthwise separable convolution operation. This bridges the accuracy gap between FuSeConv networks and networks with depthwise-separable convolutions. Additionally, NOS can be combined with Neural Architecture Search (NAS) to trade-off latency and accuracy.The hardware-software co-design of FuSeConv with ST-OS achieves a significant speedup of 4.1 − 9.25× with state-of-the-art efficient networks for the ImageNet dataset. The parameter efficiency of FuSeConv and its significant out-performance over depthwiseseparable convolutions on systolic arrays illustrates their promise as a strong solution on the edge. Training FuSeConv networks with NOS achieves accuracy comparable to the depthwise-separable convolution baselines. Further, by combining NOS with NAS, we design networks that define state-of-the-art models improving on both accuracy and latency for computer vision on systolic arrays.

show abstract

Invited: Co-Design of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications

Cited by 16 publications

References 8 publications

Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

I-BERT: Integer-only BERT Quantization

Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

Contact Info

Product

Resources

About