A Power-Efficient Optimizing Framework FPGA Accelerator Based on Winograd for YOLO

Bao, Chun; Xie, Tao; Feng, Wu‐Shiung; Chang, Le; Yu, Chongchong

doi:10.1109/access.2020.2995330

Cited by 38 publications

(16 citation statements)

References 36 publications

(37 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Later, in the literature [ 31 ], the Winograd algorithm was proposed to be combined with the CNN sparsity to improve accelerator performance, but the model used in its evaluation was simple. Bao et al [ 32 ] used a fixed-point quantization approach to reduce FPGA resource consumption and proposed a buffer pipeline approach to further improve the accelerator efficiency while reducing the resource and power overhead. Wang et al [ 33 ] introduced a new unstructured sparse convolution algorithm using a lower quantization method and an end-to-end design space search sparse convolution dedicated circuit architecture, which achieved high computational efficiency, but its performance-to-power ratio was relatively low.…”

Section: Background and Related Workmentioning

confidence: 99%

FPGA-Based Vehicle Detection and Tracking Accelerator

Zhai

et al. 2023

Sensors

View full text Add to dashboard Cite

A convolutional neural network-based multiobject detection and tracking algorithm can be applied to vehicle detection and traffic flow statistics, thus enabling smart transportation. Aiming at the problems of the high computational complexity of multiobject detection and tracking algorithms, a large number of model parameters, and difficulty in achieving high throughput with a low power consumption in edge devices, we design and implement a low-power, low-latency, high-precision, and configurable vehicle detector based on a field programmable gate array (FPGA) with YOLOv3 (You-Only-Look-Once-version3), YOLOv3-tiny CNNs (Convolutional Neural Networks), and the Deepsort algorithm. First, we use a dynamic threshold structured pruning method based on a scaling factor to significantly compress the detection model size on the premise that the accuracy does not decrease. Second, a dynamic 16-bit fixed-point quantization algorithm is used to quantify the network parameters to reduce the memory occupation of the network model. Furthermore, we generate a reidentification (RE-ID) dataset from the UA-DETRAC dataset and train the appearance feature extraction network on the Deepsort algorithm to improve the vehicles’ tracking performance. Finally, we implement hardware optimization techniques such as memory interlayer multiplexing, parameter rearrangement, ping-pong buffering, multichannel transfer, pipelining, Im2col+GEMM, and Winograd algorithms to improve resource utilization and computational efficiency. The experimental results demonstrate that the compressed YOLOv3 and YOLOv3-tiny network models decrease in size by 85.7% and 98.2%, respectively. The dual-module parallel acceleration meets the demand of the 6-way parallel video stream vehicle detection with the peak throughput at 168.72 fps.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

FPGA-Based Vehicle Detection and Tracking Accelerator

Zhai

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…An extensive review of hardware acceleration methods from multiple points of view can be read from the review works of [12,13]. Some optimization methods include replacing the standard convolution algorithm altogether with faster algorithms such as fast Fourier transform (FFT) [14,15] or Winograd [16,17]. Other methods based on the transformation of convolution computation include performing convolution as matrix multiplication [18].…”

Section: Related Workmentioning

confidence: 99%

Resource- and Power-Efficient High-Performance Object Detection Inference Acceleration Using FPGA

Tesema

Bourennane

2022

Electronics

View full text Add to dashboard Cite

The success of deep convolutional neural networks in solving age-old computer vision challenges, particularly object detection, came with high requirements in terms of computation capability, energy consumption, and a lack of real-time processing capability. However, FPGA-based inference accelerations have recently been receiving more attention from academia and industry due to their high energy efficiency and flexible programmability. This paper presents resource-efficient yet high-performance object detection inference acceleration with detailed implementation and design choices. We tested our object detection acceleration by implementing YOLOv2 on two FPGA boards and achieved up to 184 GOPS with limited resource utilization.

show abstract

“…Miranda et al [18] achieved 30.8 mAP50 accuracy and 14 FPS for 8 bits precision, 31.5 mAP50 accuracy and 7 FPS for 16 bits precision on COCO dataset. Bao et al [19] also proposed a power efficient Yolov2 architecture with the pipelined network structure. PS runs Ubuntu OS with PYNQ [25] on it.…”

Section: Related Workmentioning

confidence: 99%

LPYOLO: Low Precision YOLO for Face Detection on FPGA

Günay¹,

Okcu²,

Bilge³

2022

World Congress on Electrical Engineering and Computer Systems and Science

View full text Add to dashboard Cite

In recent years, number of edge computing devices and artificial intelligence applications on them have advanced excessively. In edge computing, decision making processes and computations are moved from servers to edge devices. Hence, cheap and low power devices are required. FPGAs are very low power, inclined to do parallel operations and deeply suitable devices for running Convolutional Neural Networks (CNN) which are the fundamental unit of an artificial intelligence application. Face detection on surveillance systems is the most expected application on the security market. In this work, TinyYolov3 architecture is redesigned and deployed for face detection. It is a CNN based object detection method and developed for embedded systems. PYNQ-Z2 is selected as a target board which has low-end Xilinx Zynq 7020 System-on-Chip (SoC) on it. Redesigned TinyYolov3 model is defined in numerous bit width precisions with Brevitas library which brings fundamental CNN layers and activations in integer quantized form. Then, the model is trained in a quantized structure with WiderFace dataset. In order to decrease latency and power consumption, onchip memory of the FPGA is configured as a storage of whole network parameters and the last activation function is modified as rescaled HardTanh instead of Sigmoid. Also, high degree of parallelism is applied to logical resources of the FPGA. The model is converted to an HLS (High-Level-Synthesis) based application with using FINN framework and FINN-HLS library which includes the layer definitions in C++. Later, the model is synthesized and deployed. CPU of the SoC is employed with multithreading mechanism and responsible for preprocessing, postprocessing and TCP/IP streaming operations. Consequently, 2.4 Watt total board power consumption, 18 Frames-Per-Second (FPS) throughput and 0.757 Mean-Average-Precision (mAP) accuracy rate on Easy category of the WiderFace are achieved with 4 bits precision model.

show abstract

A Power-Efficient Optimizing Framework FPGA Accelerator Based on Winograd for YOLO

Cited by 38 publications

References 36 publications

FPGA-Based Vehicle Detection and Tracking Accelerator

FPGA-Based Vehicle Detection and Tracking Accelerator

Resource- and Power-Efficient High-Performance Object Detection Inference Acceleration Using FPGA

LPYOLO: Low Precision YOLO for Face Detection on FPGA

Contact Info

Product

Resources

About