FPGA-Based High-Throughput CNN Hardware Accelerator With High Computing Resource Utilization Ratio

Huang, Wenjin; Wu, Huangtao; Chen, Qingkun; Luo, Conghui; Zeng, Shihao; Li, Tianrui; Huang, Yihua

doi:10.1109/tnnls.2021.3055814

Cited by 45 publications

(23 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Extensive experiments demonstrate that the DL-CSNet shows clear superiority over most classical CS algorithms. At last, the DL-CSNet could be accelerated on hardware such as FPGA [19].…”

Section: Discussionmentioning

confidence: 99%

DL-CSNet: Dictionary Learning based Compressed Sensing Neural Network

Qiu

Zhang

Huang

et al. 2022

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

In this paper, we propose a novel neural network for Compressed Sensing (CS) application: the Dictionary Learning based Compressed Sensing neural Network (DL-CSNet). It is fairly simple but highly effective, which consists of only three layers: 1) a DL layer for latent sparse features extraction; 2) a smoothing layer via Total Variation (TV) like constraint; and 3) a CS acquisition layer for neural network training. In particular, the TV-like smoothing layer is a perfect complement to the sparsity-oriented DL layer to achieve smooth images. The trained DL-CSNet can learn the optimal dictionary matrix so that images can be reconstructed in high quality. At last, extensive experiments have been carried out on binary images and compared to most classical CS algorithms, which shows the superior performance of the proposed DL-CSNet.

show abstract

“…Extensive experiments demonstrate that the DL-CSNet shows clear superiority over most classical CS algorithms. At last, the DL-CSNet could be accelerated on hardware such as FPGA [19].…”

Section: Discussionmentioning

confidence: 99%

DL-CSNet: Dictionary Learning based Compressed Sensing Neural Network

Qiu

Zhang

Huang

et al. 2022

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

show abstract

“…The architecture they mentioned has been implemented on Artix-7 FPGA and attained a significant improvement in speed when compared to existing architecture working at 300 MHz. Huang et al proposed a novel composite hardware CNN accelerator architecture to solve the problem of the inefficient computing resource mapping mechanism and data supply [23]. They proposed a multi-CE architecture based on a row-level pipelined streaming strategy for convolution layers and a single-CE architecture based on a batch-based computing method for full-connection layers.…”

Section: Related Workmentioning

confidence: 99%

Research on the Lightweight Deployment Method of Integration of Training and Inference in Artificial Intelligence

Zheng

2022

Applied Sciences

View full text Add to dashboard Cite

In recent years, the continuous development of artificial intelligence has largely been driven by algorithms and computing power. This paper mainly discusses the training and inference methods of artificial intelligence from the perspective of computing power. To address the issue of computing power, it is necessary to consider performance, cost, power consumption, flexibility, and robustness comprehensively. At present, the training of artificial intelligence models mostly are based on GPU platforms. Although GPUs offer high computing performance, their power consumption and cost are relatively high. It is not suitable to use GPUs as the implementation platform in certain application scenarios with demanding power consumption and cost. The emergence of high-performance heterogeneous architecture devices provides a new path for the integration of artificial intelligence training and inference. Typically, in Xilinx and Intel’s multi-core heterogeneous architecture, multiple high-performance processors and FPGAs are integrated into a single chip. When compared with the current separate training and inference method, heterogeneous architectures leverage a single chip to realize the integration of AI training and inference, providing a good balance of training and inference of different targets, further reducing the cost of training and implementation of AI inference and power consumption, so as to achieve the lightweight goals of computation, and to improve the flexibility and robustness of the system. In this paper, based on the LeNet-5 network structure, we first introduced the process of network training using a multi-core CPU in Xilinx’s latest multi-core heterogeneous architecture device, MPSoC. Then, the method of converting the network model into hardware logic implementation was studied, and the model parameters were transferred from the processing system of the device to the hardware accelerator structure, composed of programmable logic through the bus interface AXI provided on the chip. Finally, the integrated implementation method was tested and verified in Xilinx MPSoC. According to the test results, the recognition accuracy of this lightweight deployment scheme on MNIST dataset and CIFAR-10 dataset reached 99.5 and 75.4% respectively, while the average processing time of the single frame was only 2.2 ms. In addition, the power consumption of the network within the SoC hardware accelerator is only 1.363 W at 100 MHz.

show abstract

“…The performance bottleneck of the off-chip memory is the data transfer delay, which can slow the data supply. During the operation of a CNN, frequent readings of the parameters in the memory are required, and the mismatch between the rates of data reading and calculation can cause the computational module to fail to achieve the expected efficiency and affect the system performance [ 10 ]. The huge amount of computation also leads to the challenge of deploying algorithms in smart chips with limited computational resources and I/O ports.…”

Section: Introductionmentioning

confidence: 99%

A Hardware-Friendly High-Precision CNN Pruning Method and Its FPGA Implementation

Sui

Zhi

et al. 2023

Sensors

View full text Add to dashboard Cite

To address the problems of large storage requirements, computational pressure, untimely data supply of off-chip memory, and low computational efficiency during hardware deployment due to the large number of convolutional neural network (CNN) parameters, we developed an innovative hardware-friendly CNN pruning method called KRP, which prunes the convolutional kernel on a row scale. A new retraining method based on LR tracking was used to obtain a CNN model with both a high pruning rate and accuracy. Furthermore, we designed a high-performance convolutional computation module on the FPGA platform to help deploy KRP pruning models. The results of comparative experiments on CNNs such as VGG and ResNet showed that KRP has higher accuracy than most pruning methods. At the same time, the KRP method, together with the GSNQ quantization method developed in our previous study, forms a high-precision hardware-friendly network compression framework that can achieve “lossless” CNN compression with a 27× reduction in network model storage. The results of the comparative experiments on the FPGA showed that the KRP pruning method not only requires much less storage space, but also helps to reduce the on-chip hardware resource consumption by more than half and effectively improves the parallelism of the model in FPGAs with a strong hardware-friendly feature. This study provides more ideas for the application of CNNs in the field of edge computing.

show abstract

FPGA-Based High-Throughput CNN Hardware Accelerator With High Computing Resource Utilization Ratio

Cited by 45 publications

References 28 publications

DL-CSNet: Dictionary Learning based Compressed Sensing Neural Network

DL-CSNet: Dictionary Learning based Compressed Sensing Neural Network

Research on the Lightweight Deployment Method of Integration of Training and Inference in Artificial Intelligence

A Hardware-Friendly High-Precision CNN Pruning Method and Its FPGA Implementation

Contact Info

Product

Resources

About