A Reconfigurable Neural Network Processor With Tile-Grained Multicore Pipeline for Object Detection on FPGA

Chang, Libo; Zhang, Shengbing; Du, Heng; Chen, Yue; Wang, Shiyu

doi:10.1109/tvlsi.2021.3109580

Cited by 6 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CNFG [2] CNFG [3] CNFG [4] CNFG [5] CNFG [6] CNFG [7] CNFG [8] CNFG [9] CNFG [10] CNFG [11] CNFG [12] CNFG [13] CNFG [14] CNFG [15] In [0] In [1] In [2] In [3] Out Programmable Filter…”

Section: Figmentioning

confidence: 99%

“…In particular, it has been demonstrated in [3] that is possible to achieve a complete design flow for mapping CNN on FPGA. The FPGA reconfigurability provides the advantage of adapting their design to the CNN inference models without requiring significant modification of the hardware architecture [14]. Besides, FPGAs achieve extensive computational parallelism, enabling the usage of depth-wise separable convolution instead of standard convolution CNN, reducing the number of used Multiply and Accumulate (MAC) modules [15].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

CNN-Oriented Placement Algorithm for High-Performance Accelerators on Rad-Hard FPGAs

Sterpone,

Azimi,

Sio

2024

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) are quickly becoming one of the most common applications running on hardware accelerators. Considering Field Programmable Gate Arrays (FPGAs), due to their high flexibility and computational performance, they are suitable for fast classification tasks and therefore, pave the way for new machine learning inference approaches. In this work, we first designed a fully interconnected CNN architecture implementable on a single FPGA. Secondly, we developed a new Neural Node-oriented placement algorithm to enable resilient CNN accelerators on space-grade FPGAs. The proposed solution reduces the single event transient error sensitivity of CNN single neuron cores while achieving high performance and effective overall convolutional architecture fault tolerance. The developed approach has been applied and integrated into a state-of-the-art Radiation Tolerant FPGAs (RTG4) implementation flow. The experimental evaluation has been performed on a Microchip test board through benchmark application performance evaluation and transient error analysis. Experimental results demonstrate an improvement of 27.2% of the maximal working frequency and a reduction of the transient error sensitivity of about three times with respect to the previous mitigation approaches.

show abstract

“…CNFG [2] CNFG [3] CNFG [4] CNFG [5] CNFG [6] CNFG [7] CNFG [8] CNFG [9] CNFG [10] CNFG [11] CNFG [12] CNFG [13] CNFG [14] CNFG [15] In [0] In [1] In [2] In [3] Out Programmable Filter…”

Section: Figmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

CNN-Oriented Placement Algorithm for High-Performance Accelerators on Rad-Hard FPGAs

Sterpone,

Azimi,

Sio

2024

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…Gong et al [35] proposed an accelerator architecture that used both static and dynamic reconfigurabilities of the hardware. Chang et al [36] proposed a reconfigurable CNN processor with a parallelism complementary and hierarchical multicore pipelining architecture. By well-designed architecture and data path, these works' throughput achieved over 1000 GOPS, but they did not pay attention to reducing latency.…”

Section: Cnn Fpga Implementationmentioning

confidence: 99%

Global to multi‐scale local architecture with hardwired CNN for 1‐ms tomato defect detection

Li,

Hu,

Fuchikami

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

A 1 millisecond (1‐ms) vision system that guarantees high efficiency and timely response for tomato defect detection is essential for factory automation. Because of various defect appearances, recently many existing researches focus on CNN based defect detection, but few of them attempt to reach high processing speed to adapt to the factorial assembly line. This paper proposes a global to multi‐scale local based parallel architecture with hardwired CNN for tomato defect detection. This architecture breaks down image‐wise detection into pixel‐wise localization and block‐wise classification. The pixel‐wise localization utilizes tomato‐aware information as constraints for localization performance. The block‐wise classification uses a fully pipelined network structure to obtain the classification result for each block as the pixel stream moves through the network. The classification network has a six‐layer lightweight network structure with quantization for hardwired type implementation on FPGA. The experiment results show that the proposed architecture processes 1000 FPS images with 0.9476 ms/frame delay. And for detection performance, this architecture keeps at 80.18%, only 1.31% lower than ResNet50 based detection system.

show abstract

“…Other works [14]- [16] used on-chip memory for low off-chip memory access. Pipelined architecture [27], [28] has been explored to speed up infer-ence time. Although these works can improve the performance of the accelerator, they still use a double data rate (DDR) for the off-chip memory.…”

Section: Related Workmentioning

confidence: 99%

Implementation of Fully-Pipelined CNN Inference Accelerator on FPGA and HBM2 Platform

Nguyen

NAKASHIMA

2023

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Many deep convolutional neural network (CNN) inference accelerators on the field-programmable gate array (FPGA) platform have been widely adopted due to their low power consumption and high performance. In this paper, we develop the following to improve performance and power efficiency. First, we use a high bandwidth memory (HBM) to expand the bandwidth of data transmission between the off-chip memory and the accelerator. Second, a fully-pipelined manner, which consists of pipelined inter-layer computation and a pipelined computation engine, is implemented to decrease idle time among layers. Third, a multi-core architecture with shared-dual buffers is designed to reduce off-chip memory access and maximize the throughput. We designed the proposed accelerator on the Xilinx Alveo U280 platform with in-depth Verilog HDL instead of high-level synthesis as the previous works and explored the VGG-16 model to verify the system during our experiment. With a similar accelerator architecture, the experimental results demonstrate that the memory bandwidth of HBM is 13.2× better than DDR4. Compared with other accelerators in terms of throughput, our accelerator is 1.9×/1.65×/11.9× better than FPGA+HBM2 based/low batch size (4) GPGPU/low batch size (4) CPU. Compared with the previous DDR+FPGA/DDR+GPGPU/DDR+CPU based accelerators in terms of power efficiency, our proposed system provides 1.4-1.7×/1.7-12.6×/6.6-37.1× improvement with the large-scale CNN model.

show abstract

A Reconfigurable Neural Network Processor With Tile-Grained Multicore Pipeline for Object Detection on FPGA

Cited by 6 publications

References 29 publications

CNN-Oriented Placement Algorithm for High-Performance Accelerators on Rad-Hard FPGAs

CNN-Oriented Placement Algorithm for High-Performance Accelerators on Rad-Hard FPGAs

Global to multi‐scale local architecture with hardwired CNN for 1‐ms tomato defect detection

Implementation of Fully-Pipelined CNN Inference Accelerator on FPGA and HBM2 Platform

Contact Info

Product

Resources

About