DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-Based CNN Accelerators

Yu, Xiao; Liang, Shuang; Sui, Lingzhi; Jia, Xijie; Qiu, Jiantao; Liu, Xin; Wang, Yushun; Shan, Yi; Wang, Yu

doi:10.1109/tcad.2019.2930577

Cited by 60 publications

(18 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To make the lightweight processing suitable for the on-device DCNN processing, the proposed DCNN architecture is further optimized by applying the quantization method [ 48 ] to represent each network parameter with an 8 bit fixed-point number. In addition, the layer-fusing method from [ 49 ] is utilized to merge two adjacent processing layers: one convolution layer and the following pooling layer, making a single processing layer associated with fewer parameters. Table 5 compares the proposed DCNN architecture with the previous method from [ 43 ], which provides the smallest model size among existing works, as summarized in Table 3 .…”

Section: Proposed Power Optimization Methodsmentioning

confidence: 99%

Energy-Efficient Wearable EPTS Device Using On-Device DCNN Processing for Football Activity Classification

Kim

et al. 2020

Sensors

View full text Add to dashboard Cite

This paper presents an energy-optimized electronic performance tracking system (EPTS) device for analyzing the athletic movements of football players. We first develop a tiny battery-operated wearable device that can be attached to the backside of field players. In order to analyze the strategic performance, the proposed wearable EPTS device utilizes the GNSS-based positioning solution, the IMU-based movement sensing system, and the real-time data acquisition protocol. As the life-time of the EPTS device is in general limited due to the energy-hungry GNSS sensing operations, for the energy-efficient solution extending the operating time, in this work, we newly develop the advanced optimization methods that can reduce the number of GNSS accesses without degrading the data quality. The proposed method basically identifies football activities during the match time, and the sampling rate of the GNSS module is dynamically relaxed when the player performs static movements. A novel deep convolution neural network (DCNN) is newly developed to provide the accurate classification of human activities, and various compression techniques are applied to reduce the model size of the DCNN algorithm, allowing the on-device DCNN processing even at the memory-limited EPTS device. Experimental results show that the proposed DCNN-assisted sensing control can reduce the active power by 28%, consequently extending the life-time of the EPTS device more than 1.3 times.

show abstract

Section: Proposed Power Optimization Methodsmentioning

confidence: 99%

Energy-Efficient Wearable EPTS Device Using On-Device DCNN Processing for Football Activity Classification

Kim

et al. 2020

Sensors

View full text Add to dashboard Cite

show abstract

“…However, depthwise separable convolution spends 95% computation time in Conv 1 × 1, which causes a large MAdds gap between two consecutive laysers (Conv 1 × 1 and Conv DW 3×3) [12]. This gap is unfriendly to embedded systems who load all weights of the network to perform convolution [24]: embedded systems need extra buffers for Conv 1 × 1.…”

Section: Variable Group Convolutionmentioning

confidence: 99%

“…Communication between off-chip memory and on-chip memory only happens on the start and the end of block computing when a block is grouped and computed together on embedded systems [24]. To limit the communication cost, VarGNet sets the number of output channels to be same as the number of input channels in the normal block.…”

Section: Blocks Of Variable Group Networkmentioning

confidence: 99%

VarGFaceNet: An Efficient Variable Group Convolutional Neural Network for Lightweight Face Recognition

Yan

Zhao

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

To improve the discriminative and generalization ability of lightweight network for face recognition, we propose an efficient variable group convolutional network called VarGFaceNet. Variable group convolution is introduced by VarGNet to solve the conflict between small computational cost and the unbalance of computational intensity inside a block. We employ variable group convolution to design our network which can support large scale face identification while reduce computational cost and parameters. Specifically, we use a head setting to reserve essential information at the start of the network and propose a particular embedding setting to reduce parameters of fully-connected layer for embedding. To enhance interpretation ability, we employ an equivalence of angular distillation loss to guide our lightweight network and we apply recursive knowledge distillation to relieve the discrepancy between the teacher model and the student model. The champion of deepglintlight track of LFR (2019) challenge demonstrates the effectiveness of our model and approach. Implementation of VarGFaceNet will be released at https://github.com/zma-c-137/VarGFaceNet soon.

show abstract

“…The authors also employ a data quantization strategy that is implemented in a dynamic way across layers and takes place during the training phase. An extension of this work is presented in [24], where the authors propose an end-to-end compiler that integrates optimizers for graphs, loops and data layouts. The main optimization utilized targets fusion of graph parts, operations, layers, operations across different kernels, etc., and exploring effective fusion strategies.…”

Section: Related Workmentioning

confidence: 99%

“…This framework allows the user to customize the design of an equivalent CNN, and generates both a synthesizable C++ code and ready-to-use scripts for Xilinx Vivado. In [24] the authors propose an end-to-end compiler that integrates optimizers for graphs, loops and data layouts and aim at generating more smart instructions. The authors in [29] propose a uniformed mathematical representation for efficient FPGA acceleration of all layers in CNN/DNN models and a framework that finds the optimal mapping of this representation to the specialized accelerator based on roofline model.…”

Section: Related Workmentioning

confidence: 99%

A TensorFlow Extension Framework for Optimized Generation of Hardware CNN Inference Engines

et al. 2020

View full text Add to dashboard Cite

The workloads of Convolutional Neural Networks (CNNs) exhibit a streaming nature that makes them attractive for reconfigurable architectures such as the Field-Programmable Gate Arrays (FPGAs), while their increased need for low-power and speed has established Application-Specific Integrated Circuit (ASIC)-based accelerators as alternative efficient solutions. During the last five years, the development of Hardware Description Language (HDL)-based CNN accelerators, either for FPGA or ASIC, has seen huge academic interest due to their high-performance and room for optimizations. Towards this direction, we propose a library-based framework, which extends TensorFlow, the well-established machine learning framework, and automatically generates high-throughput CNN inference engines for FPGAs and ASICs. The framework allows software developers to exploit the benefits of FPGA/ASIC acceleration without requiring any expertise on HDL development and low-level design. Moreover, it provides a set of optimization knobs concerning the model architecture and the inference engine generation, allowing the developer to tune the accelerator according to the requirements of the respective use case. Our framework is evaluated by optimizing the LeNet CNN model on the MNIST dataset, and implementing FPGA- and ASIC-based accelerators using the generated inference engine. The optimal FPGA-based accelerator on Zynq-7000 delivers 93% less memory footprint and 54% less Look-Up Table (LUT) utilization, and up to 10× speedup on the inference execution vs. different Graphics Processing Unit (GPU) and Central Processing Unit (CPU) implementations of the same model, in exchange for a negligible accuracy loss, i.e., 0.89%. For the same accuracy drop, the 45 nm standard-cell-based ASIC accelerator provides an implementation which operates at 520 MHz and occupies an area of 0.059 mm 2 , while the power consumption is ∼7.5 mW.

show abstract

DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-Based CNN Accelerators

Cited by 60 publications

References 33 publications

Energy-Efficient Wearable EPTS Device Using On-Device DCNN Processing for Football Activity Classification

Energy-Efficient Wearable EPTS Device Using On-Device DCNN Processing for Football Activity Classification

VarGFaceNet: An Efficient Variable Group Convolutional Neural Network for Lightweight Face Recognition

A TensorFlow Extension Framework for Optimized Generation of Hardware CNN Inference Engines

Contact Info

Product

Resources

About