FPGA‐accelerated deep convolutional neural networks for high throughput and energy efficiency

Qiao, Yan; Shen, Junzhong; Xiao, Tao; Yang, Qianming; Wen, Mei; Zhang, Chunyuan

doi:10.1002/cpe.3850

Cited by 45 publications

(23 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Focusing on visual tasks-oriented proposals, those based on FPGAs stand out in terms of energy efficiency but not in performance [19], [20]. Additionally, some of them are supported in other elements as CPUs or external DRAM memory [21]- [24], the networks used as benchmark are not always representative [25], [26] or their price limits the application range [27], [28].…”

Section: Hardware Implementationmentioning

confidence: 99%

Deep Learning-Based Multiple Object Visual Tracking on Embedded System for IoT and Mobile Edge Computing Applications

Blanco-Filgueira

Garcia-Lesta

Fernández-Sanjurjo

et al. 2019

IEEE Internet Things J.

View full text Add to dashboard Cite

Compute and memory demands of state-of-the-art deep learning methods are still a shortcoming that must be addressed to make them useful at IoT end-nodes. In particular, recent results depict a hopeful prospect for image processing using Convolutional Neural Netwoks, CNNs, but the gap between software and hardware implementations is already considerable for IoT and mobile edge computing applications due to their high power consumption. This proposal performs low-power and real time deep learning-based multiple object visual tracking implemented on an NVIDIA Jetson TX2 development kit. It includes a camera and wireless connection capability and it is battery powered for mobile and outdoor applications. A collection of representative sequences captured with the onboard camera, dETRUSC video dataset, is used to exemplify the performance of the proposed algorithm and to facilitate benchmarking. The results in terms of power consumption and frame rate demonstrate the feasibility of deep learning algorithms on embedded platforms although more effort to joint algorithm and hardware design of CNNs is needed.

show abstract

Section: Hardware Implementationmentioning

confidence: 99%

Deep Learning-Based Multiple Object Visual Tracking on Embedded System for IoT and Mobile Edge Computing Applications

Blanco-Filgueira

Garcia-Lesta

Fernández-Sanjurjo

et al. 2019

IEEE Internet Things J.

View full text Add to dashboard Cite

show abstract

“…However, the solution introduces a large overhead associated with the memory accesses and execution times necessary to rearrange the input maps. This overhead was partially eliminated in [15] using an accelerator for matrix multiplication and dedicated units to convert the inputs maps into a matrix.…”

Section: Related Workmentioning

confidence: 99%

Fast Convolutional Neural Networks in Low Density FPGAs Using Zero-Skipping and Weight Pruning

et al. 2019

View full text Add to dashboard Cite

Edge devices are becoming smarter with the integration of machine learning methods, such as deep learning, and are therefore used in many application domains where decisions have to be made without human intervention. Deep learning and, in particular, convolutional neural networks (CNN) are more efficient than previous algorithms for several computer vision applications such as security and surveillance, where image and video analysis are required. This better efficiency comes with a cost of high computation and memory requirements. Hence, running CNNs in embedded computing devices is a challenge for both algorithm and hardware designers. New processing devices, dedicated system architectures and optimization of the networks have been researched to deal with these computation requirements. In this paper, we improve the inference execution times of CNNs in low density FPGAs (Field-Programmable Gate Arrays) using fixed-point arithmetic, zero-skipping and weight pruning. The developed architecture supports the execution of large CNNs in FPGA devices with reduced on-chip memory and computing resources. With the proposed architecture, it is possible to infer an image in AlexNet in 2.9 ms in a ZYNQ7020 and 1.0 ms in a ZYNQ7045 with less than 1% accuracy degradation. These results improve previous state-of-the-art architectures for CNN inference.

show abstract

“…Due to the high computational complexity of the convolutional layer, prior work has addressed parallelism of the computation by unrolling the 2D convolution to matrix multiplication [12] or reducing the number of operations using Fast Fourier Transform [10]. However, parallelization by unrolling encounters a bottleneck due to limited on-chip memory of FPGAs.…”

Section: Introductionmentioning

confidence: 99%

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Zhang

Prasanna

2017

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

129

View full text Add to dashboard Cite

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational requirements of the convolutional layer. We map the frequency domain algorithms onto a highly-parallel OaA-based 2D convolver design on the FPGA. Then, we propose a novel data layout in shared memory for efficient data communication between the CPU and the FPGA. To reduce the memory access latency and sustain peak performance of the FPGA, our design employs double buffering. To reduce the inter-layer data remapping latency, we exploit concurrent processing on the CPU and the FPGA. Our approach can be applied to any kernel size less than the chosen FFT size with appropriate zero-padding leading to acceleration of a wide range of CNN models. We exploit the data parallelism of OaA-based 2D convolver and task parallelism to scale the overall system performance. By using OaA, the number of floating point operations is reduced by 39.14% ∼ 54.10% for the state-of-art CNNs. We implement VGG16, AlexNet and GoogLeNet on Intel Quick-Assist QPI FPGA Platform. These designs sustain 123.48 GFLOPs/sec, 83.00 GFLOPs/sec and 96.60 GFLOPs/sec, respectively. Compared with the state-of-the-art AlexNet implementation, our design achieves 1.35x GFLOPs/sec improvement using 3.33x less multipliers and 1.1x less memory. Compared with the state-of-art VGG16 implementation, our design has 0.66x GFLOPs/sec using 3.48x less multipliers without impacting the classification accuracy. For GoogLeNet implementation, our design achieves 5.56x improvement in performance compared with 16 threads running on a 10 Core Intel Xeon Processor at 2.8 GHz.

show abstract

FPGA‐accelerated deep convolutional neural networks for high throughput and energy efficiency

Cited by 45 publications

References 20 publications

Deep Learning-Based Multiple Object Visual Tracking on Embedded System for IoT and Mobile Edge Computing Applications

Deep Learning-Based Multiple Object Visual Tracking on Embedded System for IoT and Mobile Edge Computing Applications

Fast Convolutional Neural Networks in Low Density FPGAs Using Zero-Skipping and Weight Pruning

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Contact Info

Product

Resources

About