GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks

Kang, Minhoo; Hwang, Ranggi; Lee, Jiwon; Kam, Dongyun; Lee, Young‐Joo; Rhu, Minsoo

doi:10.48550/arxiv.2203.00158

Cited by 1 publication

(2 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They also propose FlowGNN [43] which can flexibly support the majority of message-passing GNNs. [11] proposes a GCN accelerator named GROW with Gustavson's algorithm to architect a sparse-dense GEMM accelerator with row-wise product. [44] proposes MultiGCN which balance network latency and network bandwidth for large-scale GCNs in multi-node acceleration systems.…”

Section: Related Workmentioning

confidence: 99%

“…Hence, accelerating GNN inference using reconfigurable accelerators such as FPGAs is essential in the LHC since it would enable sophisticated processing to run in real-time on the data stream from detectors with superior accuracy. Many existing GNN accelerators of FPGAs are often designed using a single engine architecture to process layers or sub-layers (blocks) repeatedly like GPUs, and the networks are processed in a recurrent fashion [6,7,8,9,10,11]. However, this is not efficient for GNN execution when targeting small graphs with requirements of ultra-low latency and high throughput for scientific applications, e.g., particle identification.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LL-GNN: Low Latency Graph Neural Networks on FPGAs for Particle Detectors

Que¹,

Loo²,

Fan³

et al. 2022

Preprint

View full text Add to dashboard Cite

This work proposes a novel reconfigurable architecture for low latency Graph Neural Network (GNN) design specifically for particle detectors. Accelerating GNNs for particle detectors is challenging since it requires sub-microsecond latency to deploy the networks for online event selection in the Level-1 triggers at the CERN Large Hadron Collider experiments. This paper proposes a custom code transformation with strength reduction for the matrix multiplication operations in the interaction-network based GNNs with fully connected graphs, which avoids the costly multiplication of the adjacency matrix with the input feature matrix. It exploits sparsity patterns as well as binary adjacency matrices, and avoids irregular memory access, leading to a reduction in latency and improvement in hardware efficiency. In addition, we introduce an outer-product based matrix multiplication approach which is enhanced by the strength reduction for low latency design. Also, a fusion step is introduced to further reduce the design latency. Furthermore, an GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under a given latency constraint. Finally, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 24 times faster and consumes up to 45 times less power than a GPU implementation. Compared to our previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy.

show abstract

Section: Related Workmentioning

confidence: 99%