Yijin Guan scite author profile

Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to underutilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of roofline model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.

show abstract

FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates

Guan

Liang

et al. 2017

252

127

View full text Add to dashboard Cite

FPGA-based accelerator for long short-term memory recurrent neural networks

et al. 2017

View full text Add to dashboard Cite

184QPS/W 64Mb/mm²3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System

Niu

Wang

et al. 2022

View full text Add to dashboard Cite

Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging

Guan

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many realworld applications. Since SpGEMM is a well-known memory-bounded application with vast and irregular memory accesses, considering the memory access efficiency is of critical importance for optimizing SpGEMM. Yet, the existing methods put less consideration into the memory subsystem and achieved suboptimal performance. In this paper, we thoroughly analyze the memory access patterns of SpGEMM and their influences on the memory subsystem. Based on the analysis, we propose a novel and more efficient accumulation method named BRMerge for the multi-core CPU architectures. The BRMerge accumulation method follows the row-wise dataflow. It first accesses the B matrix, generates the intermediate lists for one output row, and stores these intermediate lists in a consecutive memory space, which is implemented by a ping-pong buffer. It then immediately merges these intermediate lists generated in the previous phase two by two in a tree-like hierarchy between two ping-pong buffers. The architectural benefits of BRMerge are 1) streaming access patterns, 2) minimized TLB cache misses, and 3) reasonably high L1/L2 cache hit rates, which result in both low access latency and high bandwidth utilization when performing SpGEMM. Based on the BRMerge accumulation method, we propose two SpGEMM libraries named BRMerge-Upper and BRMerge-Precise, which use different allocation methods. Performance evaluations with 26 commonly used benchmarks on two CPU servers show that the proposed SpGEMM libraries significantly outperform the state-of-the-art SpGEMM libraries.

show abstract

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Niu

Wang

et al. 2022

View full text Add to dashboard Cite

Graph neural network (GNN) is a promising emerging application for link prediction, recommendation, etc. Existing hardware innovation is limited to single-machine GNN (SM-GNN), however, the enterprises usually adopt huge graph with large-scale distributed GNN (LSD-GNN) that has to be carried out with distributed inmemory storage. The LSD-GNN is very different from SM-GNN in terms of system architecture demand, workflow and operators, and hence characterizations.In this paper, we first quantitively characterize the LSD-GNN with industrial-grade framework and application, summarize that its challenges lie in graph sampling, including distributed graph access, long latency, and underutilized communication and memory bandwidth. These challenges are missing from previous SM-GNN targeted researches. We then propose a customized hardware architecture to solve the challenges, including a fully pipelined access engine architecture for graph access and sampling, a low-latency and bandwidth-efficient customized memory-over-fabric hardware, and a RISC-V centric control system providing good programmability. We implement the proposed architecture with full software support in a 4-card FPGA heterogeneous proof-of-concept (PoC) system. Based on the measurement result from the FPGA PoC, we demonstrate a single FPGA can provide up to 894 vCPU's sampling capability. With the goal of being profitable, programmable, and scalable, we further integrate the architecture to FPGA cloud (FaaS) at hyperscale, along with the industrial software framework. We explicitly explore eight FaaS architectures that carry out the proposed accelerator hardware. We finally conclude that off-the-shelf FaaS.base can already provide 2.47× performance per dollar improvement with our hardware. With architecture optimizations, FaaS.comm-opt with customized FPGA fabrics pushes the benefit to 7.78×, and FaaS.mem-opt with FPGA local DRAM and high-speed links to GPU further unleash the benefit to 12.58×.

show abstract

BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight Matrices

Zhou

Shi

Zhang

et al. 2021

View full text Add to dashboard Cite

GNN-PIM: A Processing-in-Memory Architecture for Graph Neural Networks

Wang

Guan

Sun

et al. 2020

View full text Add to dashboard Cite

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yijin Guan

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates

FPGA-based accelerator for long short-term memory recurrent neural networks

184QPS/W 64Mb/mm²3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System

Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight Matrices

GNN-PIM: A Processing-in-Memory Architecture for Graph Neural Networks

Contact Info

Product

Resources

About

Yijin Guan

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates

FPGA-based accelerator for long short-term memory recurrent neural networks

184QPS/W 64Mb/mm23D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System

Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight Matrices

GNN-PIM: A Processing-in-Memory Architecture for Graph Neural Networks

Contact Info

Product

Resources

About

184QPS/W 64Mb/mm²3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System