An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for <i>In Situ</i> Personalization on Smart Devices

Choi, Seungkyu; Sim, Jaehyeong; Kang, Myeonggu; Choi, Yoo-Joo; Kim, Hyeonuk; Kim, Lee-Sup

doi:10.1109/jssc.2020.3005786

Cited by 37 publications

(12 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All the coded indexes are stored in the Para-Buffer, and only the indexes need to be loaded into the AF and pooling process engines in the BP phase. The index coding reduces 50% storage compares to the design proposed in Ref [12].…”

Section: A Computation and Dataflowmentioning

confidence: 94%

“…Our processor achieves an energy efficiency of 2.44 TOPS/W for inference and 1.36 TOPS/W for training. Compare to the latest training dedicated processor Ref [12], our processor yields a 1.32х energy efficiency improvement for training. Compare to the latest inference/training processor Ref [11], our design achieves 2.1х energy efficiency improvement for training and 1.09х for inference.…”

Section: B Energy Efficiency Comparisonmentioning

confidence: 96%

“…However, most of these proposed training processors are the extended version of the inference processor, and not deeply exploit the detailed behaviors of the training. Ref [12] proposed a multi-core processor to support the complex computation of the training, but not all the cores can be reconfigured to support the inference, resulting in low resource utilization. Therefore, training DCNN in the embedded system is still a challenge.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A 2.44 Tops/W Heterogeneous DCNN Inference/Training Processor for Embedded System

Chen

Fan

Xie

et al. 2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

Since Deep Convolutional Neural Network (DCNN) training involves complex computations and data transmissions, the previous DCNN processors hard to achieve ideal energy efficiency. This paper proposed a DCNN processor supports both inference and training for the embedded system. The processor contains three heterogeneous cores to provide distinct computation patterns and dataflow for different training phases. In addition, since inference takes up more than 90% of the workload of the DCNN application, the three cores of the processor can be reconfigured to efficiently support the inference to achieve leading resources Utilization. The processor is fabricated in 55nm CMOS technology, post-layout simulation shows the processor achieving 1.36 Tops/w energy efficiency for training and 2.44 Tops/w for the inference.

show abstract

Section: A Computation and Dataflowmentioning

confidence: 94%

Section: B Energy Efficiency Comparisonmentioning

confidence: 96%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A 2.44 Tops/W Heterogeneous DCNN Inference/Training Processor for Embedded System

Chen

Fan

Xie

et al. 2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

show abstract

“…Previous work on accelerating the DNN training has focused on leveraging the sparsity present in weights and activa- tions [11], [33], [44], [45]. TensorDash [33] accelerates the DNN training process while achieving higher energy efficiency via eliminating the ineffectual operations resulted from the sparse input data.…”

Section: Accelerators For Dnn Trainingmentioning

confidence: 99%

FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding

Zhang¹,

McDanel²,

Kung³

2021

Preprint

View full text Add to dashboard Cite

Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via a shared exponent across a group of values. In this paper, we propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP. FAST supports matrix multiplication with variable precision BFP input operands, enabling incremental increases in DNN precision throughout training. By increasing the BFP precision across both training iterations and DNN layers, FAST can greatly shorten the training time while reducing overall hardware resource usage. Our FAST Multipler-Accumulator (fMAC) supports dot product computations under multiple BFP precisions. We validate our FAST system on multiple DNNs with different datasets, demonstrating a 2-6× speedup in training on a single-chip platform over prior work based on mixed-precision or block floating point number systems while achieving similar performance in validation accuracy.

show abstract

“…In terms of computing, a low-power CPU will be used for initiating data movement, performing data preprocessing tasks (normalization and dimensionality reduction), and invoking a specialized MLP accelerator. The accelerator is assumed to be able to support high-throughput, low-latency inference as well as on-device training of MLPs, similar to some of the recent advances (Choi et al, 2020). This paper introduces lightweight MLP-LIBS-ADAPT for portable and remote LIBS systems, which can also adapt to any domain shift in a semi-supervised manner.…”

Section: Accelerator Design For Libsmentioning

confidence: 99%

Semi-supervised on-device neural network adaptation for remote and portable laser-induced breakdown spectroscopy

Bhardwaj,

Gokhale

2021

Preprint

View full text Add to dashboard Cite

Laser-induced breakdown spectroscopy (LIBS) is a popular, fast elemental analysis technique used to determine the chemical composition of target samples, such as in industrial analysis of metals or in space exploration. Recently, there has been a rise in the use of machine learning (ML) techniques for LIBS data processing. However, ML for LIBS is challenging as: (i) the predictive models must be lightweight since they need to be deployed in highly resource-constrained and battery-operated portable LIBS systems; and (ii) since these systems can be remote, the models must be able to self-adapt to any domain shift in input distributions which could be due to the lack of different types of inputs in training data or dynamic environmental/sensor noise. This on-device retraining of model should not only be fast but also unsupervised due to the absence of new labeled data in remote LIBS systems. We introduce a lightweight multi-layer perceptron (MLP) model for LIBS that can be adapted on-device without requiring labels for new input data. It shows 89.3% average accuracy during data streaming, and up to 2.1% better accuracy compared to an MLP model that does not support adaptation. Finally, we also characterize the inference and retraining performance of our model on Google Pixel2 phone.

show abstract

An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices

Cited by 37 publications

References 18 publications

A 2.44 Tops/W Heterogeneous DCNN Inference/Training Processor for Embedded System

A 2.44 Tops/W Heterogeneous DCNN Inference/Training Processor for Embedded System

FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding

Semi-supervised on-device neural network adaptation for remote and portable laser-induced breakdown spectroscopy

Contact Info

Product

Resources

About