An Efficient Parallel Architecture for Convolutional Neural Networks Accelerator on FPGAs

Huang, Hongmin; Li, Xueming; Yadong, Qin; Hu, Xianghong; Xiong, Xiaoming

doi:10.1145/3546000.3546010

Cited by 2 publications

(1 citation statement)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To enhance compatibility with the DPU and optimize detection efficiency, we introduce modifications to the original YOLOv5 network: all activation functions now utilize Sigmoid [12], and the pooling kernel of the SPP [13] structure is set to 3×3, 5×5, 7×7. Specific training settings include a single target category (Fire), 9 anchors [14] (10,13,16,30,33,23,30,61,62,45,59,119,116,90,156,198,373,326), and an input image resolution of 416×416. We employ mosaic data augmentation [15] during training, with 24 frozen training iterations and a total of 48 iterations.…”

Section: Trainingmentioning

confidence: 99%

Hardware-Accelerated YOLOv5 Based on MPSoC

Liu,

Mao,

Gao

et al. 2024

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

This paper details the development of a hardware acceleration system for YOLOv5, focusing on flame detection as its primary application. The implementation leverages the APU and DPU functionalities integrated into the Zynq UltraScale+ MPSoC XCZU7EV core. The proposed solution addresses the challenge of achieving real-time target detection on mobile terminals, ensuring both real-time operation and ultralow power consumption of YOLOv5. Notably, our design approach facilitates the deployment of all target detection algorithms under TensorFlow for mobile devices. To optimize model efficiency, we employ saturated linear mapping quantization with calibration. This technique maps model weights, double bases, and activations from 32-bit to 8-bit, incurring only a 1.64% accuracy loss. The data flow design is realized through efficient data exchange between DDR, APU, and DPU, utilizing the AXI4 bus architecture. Image pre-processing and post-processing tasks are executed on the APU, while neural network inference occurs on the DPU. Our accelerated system demonstrates compelling experimental results: maintaining a detection speed of 56FPS, achieving an accuracy of 36.56% on the COCO2014 dataset, and exhibiting a total system power consumption of only 4.147W. Furthermore, the energy consumption ratio is measured at 15.41GOPS/W, surpassing the RTX A6000 graphics card by a factor of 55.

show abstract