An Efficient Task Assignment Framework to Accelerate DPU-Based Convolutional Neural Network Inference on FPGAs

Zhu, Jiang; Wang, Lizan; Liu, Haolin; Tian, Shujuan; Deng, Qingyong; Li, Jianqi

doi:10.1109/access.2020.2988311

Cited by 39 publications

(30 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The previously proposed approaches targeting heterogeneous hardware utilization for CNN inferences are similar to our work in that they use multiple available resources (CPU and FPGAs in [2], CPU, GPU, and FPGAs in [23], and CPU and multiple DPUs in [24]); however, the previously proposed approaches can only be employed when simultaneously executing multiple CNN inferences, thus limiting their applicability. On the contrary, our proposed technique can be applied to the single CONV layer acceleration, which has wider applicability as compared to [23], [2], and [24]. In addition, those coarsegrained task partitioning may not work well in resourceconstrained edge devices because it is very rare to execute a large batch of images together in edge devices.…”

Section: Related Workmentioning

confidence: 99%

“…In [24], a task assignment technique is proposed for multi-CNN acceleration, which utilizes multiple deep learning processing units (DPUs) for CNN inference while CPU is responsible for task initialization. The previously proposed approaches targeting heterogeneous hardware utilization for CNN inferences are similar to our work in that they use multiple available resources (CPU and FPGAs in [2], CPU, GPU, and FPGAs in [23], and CPU and multiple DPUs in [24]); however, the previously proposed approaches can only be employed when simultaneously executing multiple CNN inferences, thus limiting their applicability. On the contrary, our proposed technique can be applied to the single CONV layer acceleration, which has wider applicability as compared to [23], [2], and [24].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge

2020

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are widely deployed for many artificial intelligence (AI) applications, such as object detection and image classification. Due to the burgeoning revolution in edge AI, CNN hardware accelerators are also being employed in resource-constrained edge devices for achieving better performance and energy efficiency at the edge. Although CNN accelerators enable fast and energyefficient CNN inference at the edge, the remaining hardware resources on the edge devices except for the CNN accelerator remain idle, which could otherwise be utilized for attaining even better performance and energy efficiency for CNN inferences. In this paper, we propose a CPU-accelerator co-scheduling technique for convolution (CONV) layer operations of CNN inferences in resource-constrained edge devices. Our proposed co-scheduling technique exploits an inherent parallelism in CNN output channels, that is, the operations for generating different output channels in a CONV layer can be executed in parallel. For load balancing between the CPU and the CNN accelerator, we also propose a simple, yet accurate latency model for CONV layer operations in the CPU and the accelerator. Based on the latency estimation of CONV layer operations provided by our proposed model, we distribute the tasks to the CPU and the CNN accelerator in a load-balance manner to minimize the idle period during the CONV layer operations in both the CPU and the CNN accelerator. We implement our proposed hardware/software (HW/SW) co-scheduling technique in various field-programmable gate array system-on-chip (FPGA-SoC) platforms as a proof-of-concept. Experimental results indicate that our proposed co-scheduling technique improves system performance by 1.18×-2.00× with energy reduction of 14.9%-49.7% as compared to the accelerator-only execution. INDEX TERMS Convolutional neural networks, resource-constrained edge devices, co-scheduling, latency model, load balancing

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge

2020

View full text Add to dashboard Cite

show abstract

“…Recently, Xilinx has released Deep Learning Processing Unit (DPU), a configurable computation engine for CNNs [ 27 ]. The parallelism that can be achieved in DPU is dependent on the target device and application.…”

Section: Related Workmentioning

confidence: 99%

Towards an Efficient CNN Inference Architecture Enabling In-Sensor Processing

Pantho

Bhowmik

Bobda

2021

Sensors

View full text Add to dashboard Cite

The astounding development of optical sensing imaging technology, coupled with the impressive improvements in machine learning algorithms, has increased our ability to understand and extract information from scenic events. In most cases, Convolution neural networks (CNNs) are largely adopted to infer knowledge due to their surprising success in automation, surveillance, and many other application domains. However, the convolution operations’ overwhelming computation demand has somewhat limited their use in remote sensing edge devices. In these platforms, real-time processing remains a challenging task due to the tight constraints on resources and power. Here, the transfer and processing of non-relevant image pixels act as a bottleneck on the entire system. It is possible to overcome this bottleneck by exploiting the high bandwidth available at the sensor interface by designing a CNN inference architecture near the sensor. This paper presents an attention-based pixel processing architecture to facilitate the CNN inference near the image sensor. We propose an efficient computation method to reduce the dynamic power by decreasing the overall computation of the convolution operations. The proposed method reduces redundancies by using a hierarchical optimization approach. The approach minimizes power consumption for convolution operations by exploiting the Spatio-temporal redundancies found in the incoming feature maps and performs computations only on selected regions based on their relevance score. The proposed design addresses problems related to the mapping of computations onto an array of processing elements (PEs) and introduces a suitable network structure for communication. The PEs are highly optimized to provide low latency and power for CNN applications. While designing the model, we exploit the concepts of biological vision systems to reduce computation and energy. We prototype the model in a Virtex UltraScale+ FPGA and implement it in Application Specific Integrated Circuit (ASIC) using the TSMC 90nm technology library. The results suggest that the proposed architecture significantly reduces dynamic power consumption and achieves high-speed up surpassing existing embedded processors’ computational capabilities.

show abstract

“…It can accelerate convolution computing and achieve efficient object recognition, detection and classification. The DPU computing core is designed on a full pipeline structure, and integrates a large number of convolution operators, adders and non-linear Pulling/ReLu operators, which can support quantization methods with different dynamic precision [30].…”

Section: Simulation Environmentmentioning

confidence: 99%

Energy and Performance Trade-Off Optimization in Heterogeneous Computing via Reinforcement Learning

Machado

Zahid

et al. 2020

Electronics

View full text Add to dashboard Cite

This paper suggests an optimisation approach in heterogeneous computing systems to balance energy power consumption and efficiency. The work proposes a power measurement utility for a reinforcement learning (PMU-RL) algorithm to dynamically adjust the resource utilisation of heterogeneous platforms in order to minimise power consumption. A reinforcement learning (RL) technique is applied to analyse and optimise the resource utilisation of field programmable gate array (FPGA) control state capabilities, which is built for a simulation environment with a Xilinx ZYNQ multi-processor systems-on-chip (MPSoC) board. In this study, the balance operation mode for improving power consumption and performance is established to dynamically change the programmable logic (PL) end work state. It is based on an RL algorithm that can quickly discover the optimization effect of PL on different workloads to improve energy efficiency. The results demonstrate a substantial reduction of 18% in energy consumption without affecting the application’s performance. Thus, the proposed PMU-RL technique has the potential to be considered for other heterogeneous computing platforms.

show abstract

An Efficient Task Assignment Framework to Accelerate DPU-Based Convolutional Neural Network Inference on FPGAs

Cited by 39 publications

References 31 publications

CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge

CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge

Towards an Efficient CNN Inference Architecture Enabling In-Sensor Processing

Energy and Performance Trade-Off Optimization in Heterogeneous Computing via Reinforcement Learning

Contact Info

Product

Resources

About