An Effective Design to Improve the Efficiency of DPUs on FPGA

Lei, Yutian; Deng, Qingyong; Long, Saiqin; Liu, Shaohui; Oh, Sangyoon

doi:10.1109/icpads51040.2020.00036

Cited by 4 publications

(6 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nevertheless, running more complex CNN models, such as VGG-16, on Zynq-7,000 platforms using DPU IP remains a challenge. Even with the quantization of the VGG-16 model with Xilinx DNNDK, the model size is 132 MB [20]. To complete the large amount of calculation needed for VGG-16, a DPU core with the maximum size realized on the ZCU102 board was used, which is the B4096 core [20].…”

Section: Resultsmentioning

confidence: 99%

“…Even with the quantization of the VGG-16 model with Xilinx DNNDK, the model size is 132 MB [20]. To complete the large amount of calculation needed for VGG-16, a DPU core with the maximum size realized on the ZCU102 board was used, which is the B4096 core [20]. Compared to the available resources on Zynq-7,000 platforms, this DPU core uses over 3.2× of DSP, 1.12× of LUT, and 1.9× of BRAM.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

CNN inference acceleration on limited resources FPGA platforms_epilepsy detection case study

Saidi

Othman

Dhouibi

et al. 2023

IJ-ICT

View full text Add to dashboard Cite

<span lang="EN-US">The use of a convolutional neural network (CNN) to analyze and classify electroencephalogram (EEG) signals has recently attracted the interest of researchers to identify epileptic seizures. This success has come with an enormous increase in the computational complexity and memory requirements of CNNs. For the sake of boosting the performance of CNN inference, several hardware accelerators have been proposed. The high performance and flexibility of the field programmable gate array (FPGA) make it an efficient accelerator for CNNs. Nevertheless, for resource-limited platforms, the deployment of CNN models poses significant challenges. For an ease of CNN implementation on such platforms, several tools and frameworks have been made available by the research community along with different optimization techniques. In this paper, we proposed an FPGA implementation for an automatic seizure detection approach using two CNN models, namely VGG-16 and ResNet-50. To reduce the model size and computation cost, we exploited two optimization approaches: pruning and quantization. Furthermore, we presented the results and discussed the advantages and limitations of two implementation alternatives for the inference acceleration of quantized CNNs on Zynq-7000: an advanced RISC machine (ARM) software implementation-based ARM, NN, software development kit (SDK) and a software/hardware implementation-based deep learning processor unit (DPU) accelerator and DNNDK toolkit.</span>

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

CNN inference acceleration on limited resources FPGA platforms_epilepsy detection case study

Saidi

Othman

Dhouibi

et al. 2023

IJ-ICT

View full text Add to dashboard Cite

show abstract

“…The use of cloud computing is inconvenient when operational critical apparatus must be monitored, due to the low reliability and high latency of remote connections which requires enough bandwidth to guarantee real-time operations; general purpose platforms, using CPUs and GPUs have got silicon sizes, prices and energy costs which are incompatible with the integration into the apparatus to be monitored [5]. Similar limitations affect devoted processors, such as the Xilinx Deep Learning Processor Unit (DPU) core [20], introduced to accelerate CNN inference on FPGAs. Although it is a configurable soft core engine supporting various basic DL features (convolution, max and average pooling, etc.…”

Section: Related Workmentioning

confidence: 99%

“…It is worthwhile to note that if we lower the operating frequency to set the ODR to 1 kHz, like the alternatives in Table IV, the power consumption of our proposal, 107 mW, remains significantly lower. With reference to Application Processing Units (APU) built with FPGAs, that from some years have become very attractive to setup highly customizable platforms [39], an interesting solution is the Xilinx DPU [20] to implement high performance NNs, including GoogLeNet, ResNet and MobileNet, on Xilinx Zynq SoC devices. The DPU IP provides some possible configurations regarding the DSP slice, LUT, block RAM, UltraRAM, the number of DPU cores, the convolution architecture, etc., to meet various types of constraints.…”

Section: A Fpgamentioning

confidence: 99%

“…The DPU IP provides some possible configurations regarding the DSP slice, LUT, block RAM, UltraRAM, the number of DPU cores, the convolution architecture, etc., to meet various types of constraints. However, even with the smallest convolution architecture, B512, as shown in Table V, the FPGA resources used by the DPU core on the Ultrascale+ ZCU102 are 36,458 LUT, 41,744 FF, 77.5 BRAM, and 124 DSP, with a power consumption of 5.718 W [20], to which memory and a program running on the APU must be provided to handle interrupts, data transfers and storage of input, temporary and output data, resulting in significantly greater overall resources than to our project. Fig.…”

Section: A Fpgamentioning

confidence: 99%

See 1 more Smart Citation

Low-Power Detection and Classification for In-Sensor Predictive Maintenance Based on Vibration Monitoring

et al. 2022

View full text Add to dashboard Cite

In this work, a new custom design of an anomaly detection and classification system is proposed. It is composed of a convolutional Auto-Encoder (AE) hardware design to perform anomaly detection which cooperates with a mixed HW/SW Convolutional Neural Network (CNN) to perform the classification of detected anomalies. The AE features a partial binarization, so that the weights are binarized while the activations, associated to some selected layers, are non-binarized. This has been necessary to meet the severe area and energy constraints that allow it to be integrated on the same die as the MEMS sensors for which it serves as a neural accelerator. The CNN shares the feature extraction module with the AE, whereas a SW classifier is triggered by the AE when a fault is detected, working asynchronously to it. The AE has been mapped on a Xilinx Artix-7 FPGA, featuring an Output Data Rate (ODR) of 365 kHz and achieving a power dissipation of 333 W/MHz. Logic synthesis has targeted TSMC CMOS 65 nm, 90 nm, and 130 nm standard cells. Best results achieved highlight a power consumption of 138 μW/MHz with an area occupation of 0.49 mm 2 when real-time operations are set. These results enable the integration of the complete neural accelerator in the CMOS circuitry that typically sits with the inertial MEMS on the same silicon die. Comparisons with the related works suggest that the proposed system is capable of state-of-the-art performances and accuracy.

show abstract