A Deep Learning Prediction Process Accelerator Based FPGA

Qi, Yu; Wang, Chao; Ma, Xiang; Li, Xi; Zhou, Xuehai

doi:10.1109/ccgrid.2015.114

Cited by 46 publications

(20 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In addition, because ML models require a high level of parallelism for efficient performance, throughput is a major issue. Memory throughput can be optimized by introducing pipelining [ 20 ].…”

Section: Challenges and Optimization Opportunities In Embedded Machine Learningmentioning

confidence: 99%

“…However, graphic processing units (GPUs), due to their high floating-point performance and thread-level parallelism, are more suitable for training deep learning models [ 13 ]. Extensive research is actively being carried out to develop suitable hardware acceleration units using FPGAs [ 20 , 21 , 22 , 23 , 24 , 25 , 26 ], GPUs, ASICs, and TPUs to create heterogeneous and sometimes distributed systems to meet up the high computational demand of deep learning models. At both the algorithm and hardware levels, optimization techniques for classical machine learning and deep learning algorithms are being investigated such as pruning, quantization, reduced precision, hardware acceleration, etc.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Overview of Machine Learning within Embedded and Mobile Devices–Optimizations and Applications

Ajani

Imoize

Atayero

2021

Sensors

View full text Add to dashboard Cite

Embedded systems technology is undergoing a phase of transformation owing to the novel advancements in computer architecture and the breakthroughs in machine learning applications. The areas of applications of embedded machine learning (EML) include accurate computer vision schemes, reliable speech recognition, innovative healthcare, robotics, and more. However, there exists a critical drawback in the efficient implementation of ML algorithms targeting embedded applications. Machine learning algorithms are generally computationally and memory intensive, making them unsuitable for resource-constrained environments such as embedded and mobile devices. In order to efficiently implement these compute and memory-intensive algorithms within the embedded and mobile computing space, innovative optimization techniques are required at the algorithm and hardware levels. To this end, this survey aims at exploring current research trends within this circumference. First, we present a brief overview of compute intensive machine learning algorithms such as hidden Markov models (HMM), k-nearest neighbors (k-NNs), support vector machines (SVMs), Gaussian mixture models (GMMs), and deep neural networks (DNNs). Furthermore, we consider different optimization techniques currently adopted to squeeze these computational and memory-intensive algorithms within resource-limited embedded and mobile environments. Additionally, we discuss the implementation of these algorithms in microcontroller units, mobile devices, and hardware accelerators. Conclusively, we give a comprehensive overview of key application areas of EML technology, point out key research directions and highlight key take-away lessons for future research exploration in the embedded machine learning domain.

show abstract

Section: Challenges and Optimization Opportunities In Embedded Machine Learningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An Overview of Machine Learning within Embedded and Mobile Devices–Optimizations and Applications

Ajani

Imoize

Atayero

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…For P 2 multiplication operations, P 2 multiplication units are used to calculate them in parallel. A classic addition tree is generally used to calculate the sum of P 2 numbers [28]. The classic adder tree expands the number of addends from P 2 to 2 log 2 (P 2 ) by padding 0, then the sum of every two addends is passed onto the next stage as the input.…”

Section: Addition Unitmentioning

confidence: 99%

Design of Efficient Floating-Point Convolution Module for Embedded System

Zhou

Wang

et al. 2021

Electronics

View full text Add to dashboard Cite

The convolutional neural network (CNN) has made great success in many fields, and is gradually being applied in edge-computing systems. Taking the limited budget of the resources in the systems into consideration, the implementation of CNNs on embedded devices is preferred. However, accompanying the increasingly complex CNNs is the huge cost of memory, which constrains its implementation on embedded devices. In this paper, we propose an efficient, pipelined convolution module based on a Brain Floating-Point (BF16) to solve this problem, which is composed of a quantization unit, a serial-to-matrix conversion unit, and a convolution operation unit. The mean error of the convolution module based on BF16 is only 0.1538%, which hardly affects the CNN inference. Additionally, when synthesized at 400 MHz, the area of the BF16 convolution module is 21.23% and 18.54% smaller than that of the INT16 and FP16 convolution modules, respectively. Furthermore, our module using the TSMC 90 nm library can run at 1 GHz by optimizing the critical path. Finally, our module was implemented on the Xilinx PYNQ-Z2 board to evaluate the performance. The experimental results show that at the frequency of 100 MHz, our module is, separately, 783.94 times and 579.35 times faster than the Cortex-M4 with FPU and Hummingbird E203, while maintaining an extremely low error rate.

show abstract

“…layers with different functions, which requires suitable hardware to accelerate its inference process. Meanwhile, many emerging fields, such as intelligent robots, unmanned aerial vehicles, autopilot cars and space probes, have imposed strict restrictions on power, delay and physical size of hardware accelerators, and traditional GPUs are hard to satisfy their requirements [7], [8]. To satisfy the above strict requirements, Field Programmable Gate Array (FPGA) has become a high performance and flexibility accelerator of CNN inference in many emerging fields [9]- [13].…”

Section: Introduction a Backgroundmentioning

confidence: 99%

An Efficient Task Assignment Framework to Accelerate DPU-Based Convolutional Neural Network Inference on FPGAs

et al. 2020

View full text Add to dashboard Cite

Field Programmable Gate Array (FPGA) has become an efficient accelerator for convolutional neural network (CNN) inference due to its high performance and flexibility. To further improve the performance of CNN inference on FPGAs, an Intellectual Property core (IP core) called Deep Learning Processor Unit (DPU) is released by Xilinx. Unlike previous FPGA-based hardware designs focusing on specific functions or CNNs, the DPU IP supports ample basic functions of deep learning, and the developers can take advantage of DPUs to accelerate CNN inference conveniently. In DPU-based CNN acceleration platform, an encapsulated scheduler plays a crucial role in task scheduling between heterogeneous ARM and multiple DPUs. However, the current scheduler is unsatisfactory because its low schedule efficiency. This paper thus presents a high performance task assignment framework built upon Xilinx hybrid CPU-FPGA MPSoC devices. We first evaluate the main causes of low schedule efficiency problem. Then, we explore the scheduler rules and improve shedule efficiency through purposeful observations and analysis. Finally, we integrate our optimizations, and propose an efficient task assignment framework to maximize performance on DPU-based CNN acceleration platform. Experimental results on Xilinx Zynq UltraScale+ MPSoC zcu104 show that our efficient task assignment framework significantly boosts schedule efficiency for small-scale CNNs (from 36% to 70%), medium-scale CNNs (from 65% to 95%), and large-scale CNNs (from 77% to 99%) compared with original schedule strategy. INDEX TERMS Field programmable gate array (FPGA), deep learning processor unit (DPU), convolutional neural network (CNN) accelerator, schedule efficiency.

show abstract

A Deep Learning Prediction Process Accelerator Based FPGA

Cited by 46 publications

References 10 publications

An Overview of Machine Learning within Embedded and Mobile Devices–Optimizations and Applications

An Overview of Machine Learning within Embedded and Mobile Devices–Optimizations and Applications

Design of Efficient Floating-Point Convolution Module for Embedded System

An Efficient Task Assignment Framework to Accelerate DPU-Based Convolutional Neural Network Inference on FPGAs

Contact Info

Product

Resources

About