FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA

Basalama, Suhail; Sohrabizadeh, Atefeh; Wang, Jie; Guo, Licheng; Cong, Jason

doi:10.1145/3570928

Cited by 11 publications

(3 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another accelerator in a different study [16] operated at a frequency of 1 GHz, consuming 512 multipliers, with a theoretical throughput of 498.6 GOPS, but the actual throughput only accounted for 48.69% of the theoretical value. Similar phenomena have also been reflected in other research [17][18][19][20]. In essence, low actual throughput reflects a low utilization efficiency of the multiplier.…”

Section: Introductionsupporting

confidence: 83%

Optimizing CNN Hardware Acceleration with Configurable Vector Units and Feature Layout Strategies

He,

Zhang,

et al. 2024

Electronics

View full text Add to dashboard Cite

Convolutional neural network (CNN) hardware acceleration is critical to improve the performance and facilitate the deployment of CNNs in edge applications. Due to its efficiency and simplicity, channel group parallelism has become a popular method for CNN hardware acceleration. However, when processing data involving small channels, there will be a mismatch between feature data and computing units, resulting in a low utilization of the computing units. When processing the middle layer of the convolutional neural network, the mismatch between the feature-usage order and the feature-loading order leads to a low input feature cache hit rate. To address these challenges, this paper proposes an innovative method inspired by data reordering technology, aiming to achieve CNN hardware acceleration that reuses the same multiplier resources. This method focuses on transforming the hardware acceleration process into feature organization, feature block scheduling and allocation, and feature calculation subtasks to ensure the efficient mapping of continuous loading and the calculation of feature data. Specifically, this paper introduces a convolutional algorithm mapping strategy and a configurable vector operation unit to enhance multiplier utilization for different feature map sizes and channel numbers. In addition, an off-chip address mapping and on-chip cache management mechanism is proposed to effectively improve the feature access efficiency and on-chip feature cache hit rate. Furthermore, a configurable feature block scheduling policy is proposed to strike a balance between weight reuse and feature writeback pressure. Experimental results demonstrate the effectiveness of this method. When using 512 multipliers and accelerating VGG16 at 100 MHz, the actual computing performance reaches 102.3 giga operations per second (GOPS). Compared with other CNN hardware acceleration methods, the average computing array utilization is as high as 99.88% and the computing density is higher.

show abstract

Section: Introductionsupporting

confidence: 83%

Optimizing CNN Hardware Acceleration with Configurable Vector Units and Feature Layout Strategies

He,

Zhang,

et al. 2024

Electronics

View full text Add to dashboard Cite

show abstract

“…Compared to Angel-eye [47], we use similar LUT resources and achieve similar performance, but our DSP usage is significantly reduced and the overall computational resource efficiency is improved by 8.51%. While we may not possess a performance advantage compared to Caffeine [48] and FlexCNN [49], our work uses far fewer resources. In fact, we demonstrate a resource efficiency improvement of 15.16% and 19.80% compared to Caffeine [48] and FlexCNN [49], respectively.…”

Section: Comparison With Related Workmentioning

confidence: 97%

“…While we may not possess a performance advantage compared to Caffeine [48] and FlexCNN [49], our work uses far fewer resources. In fact, we demonstrate a resource efficiency improvement of 15.16% and 19.80% compared to Caffeine [48] and FlexCNN [49], respectively. Furthermore, given that Xilinx's Vitis AI tool employs 8-bit quantization, the Xilinx B4096 DPU [34,50] exhibits reduced LUT resource consumption.…”

Section: Comparison With Related Workmentioning

confidence: 97%

An Overlay Accelerator of DeepLab CNN for Spacecraft Image Segmentation on FPGA

Guo,

Liu,

Liu

et al. 2024

Remote Sensing

View full text Add to dashboard Cite

Due to the absence of communication and coordination with external spacecraft, non-cooperative spacecraft present challenges for the servicing spacecraft in acquiring information about their pose and location. The accurate segmentation of non-cooperative spacecraft components in images is a crucial step in autonomously sensing the pose of non-cooperative spacecraft. This paper presents a novel overlay accelerator of DeepLab Convolutional Neural Networks (CNNs) for spacecraft image segmentation on a FPGA. First, several software–hardware co-design aspects are investigated: (1) A CNNs-domain COD instruction set (Control, Operation, Data Transfer) is presented based on a Load–Store architecture to enable the implementation of accelerator overlays. (2) An RTL-based prototype accelerator is developed for the COD instruction set. The accelerator incorporates dedicated units for instruction decoding and dispatch, scheduling, memory management, and operation execution. (3) A compiler is designed that leverages tiling and operation fusion techniques to optimize the execution of CNNs, generating binary instructions for the optimized operations. Our accelerator is implemented on a Xilinx Virtex-7 XC7VX690T FPGA at 200 MHz. Experiments demonstrate that with INT16 quantization our accelerator achieves an accuracy (mIoU) of 77.84%, experiencing only a 0.2% degradation compared to that of the original fully precision model, in accelerating the segmentation model of DeepLabv3+ ResNet18 on the spacecraft component images (SCIs) dataset. The accelerator boasts a performance of 184.19 GOPS/s and a computational efficiency (Runtime Throughput/Theoretical Roof Throughput) of 88.72%. Compared to previous work, our accelerator improves performance by 1.5× and computational efficiency by 43.93%, all while consuming similar hardware resources. Additionally, in terms of instruction encoding, our instructions reduce the size by 1.5× to 49× when compiling the same model compared to previous work.

show abstract

An Approach to Mitigate CNN Complexity on Domain-Specific Architectures

Karakchi,

Robertson

2024

Proceedings of the Second International Conference on Advances in Computing Research (ACR’24)

View full text Add to dashboard Cite

FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA

Cited by 11 publications

References 38 publications

Optimizing CNN Hardware Acceleration with Configurable Vector Units and Feature Layout Strategies

Optimizing CNN Hardware Acceleration with Configurable Vector Units and Feature Layout Strategies

An Overlay Accelerator of DeepLab CNN for Spacecraft Image Segmentation on FPGA

An Approach to Mitigate CNN Complexity on Domain-Specific Architectures

Contact Info

Product

Resources

About