Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Li, Shuangchen; Niu, Dimin; Wang, Yuhao; Han, Wei; Zhang, Zhe; Guan, Tianchan; Guan, Yijin; Liu, Heng; Huang, Linyong; Du, Zhaoyang; Xue, Fei; Fang, Yuanwei; Zheng, Hongzhong; Xie, Yuan

doi:10.1145/3470496.3527439

Cited by 14 publications

(6 citation statements)

References 52 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Static GNN. Over the last few years, there have been substantial research achievements for static GNN acceleration on GPUs covering general runtime frameworks [7,26,47,52], the SpMM-like aggregation optimization [9,15,16,48] and the scaling of distributed training [18,23,41,43,45,46,49].…”

Section: Gnn Accelerationmentioning

confidence: 99%

PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs

Wang¹,

Sun²,

Bai³

2023

Preprint

View full text Add to dashboard Cite

1 Dynamic Graph Neural Networks (DGNNs) have been broadly applied in various real-life applications, such as link prediction and pandemic forecast, to capture both static structural information and temporal characteristics from dynamic graphs. Combining both time-dependent and -independent components, DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead under the canonical one-graph-at-a-time training pattern. To tackle the challenges, we propose PiPAD, a Pipelined and PArallel DGNN training framework for the end-to-end performance optimization on GPUs. From both the algorithm and runtime level, PiPAD holistically reconstructs the overall training paradigm from the data organization to computation manner. Capable of processing multiple graph snapshots in parallel, PiPAD eliminates the unnecessary data transmission and alleviates memory access inefficiency to improve the overall performance. Our evaluation across various datasets shows PiPAD achieves 1.22 × −9.57× speedup over the state-of-theart DGNN frameworks on three representative models.

show abstract

Section: Gnn Accelerationmentioning

confidence: 99%

PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs

Wang¹,

Sun²,

Bai³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…GNN models can also be trained using full-graph, this approach does not require the sampling stage; however, full-graph training causes large memory footprint [18], [19] that may not fit in a device memory (e.g., FPGA local DDR). Therefore, HitGNN focuses on accelerating minibatch GNN training as it demonstrates advantages in accuracy, scalability on large graphs, and has been adopted by many state-of-the-art GNN frameworks [6], [8], [15], [20].…”

Section: Mini-batch Gnn Trainingmentioning

confidence: 99%

“…This is because the computation characteristics of CNN and GNN are quite different: CNN models feature structured input data with high computation intensity, while GNN models There are also works that accelerate GNN training using multiple FPGAs. [18] accelerates GNN training on a distributed platform, where the graph is stored in multiple nodes. On a distributed platform, the training performance is bottlenecked by the sampling stage.…”

Section: Related Workmentioning

confidence: 99%

HP-GNN: Generating High Throughput GNN Training Implementation on CPU-FPGA Heterogeneous Platform

Lin

Zhang

Prasanna

2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Graph Neural Networks (GNNs) have shown great success in many applications such as recommendation systems, molecular property prediction, traffic prediction, etc. Recently, CPU-FPGA heterogeneous platforms have been used to accelerate many applications by exploiting customizable data path and abundant user-controllable on-chip memory resources of FPGAs. Yet, accelerating and deploying GNN training on such platforms requires not only expertise in hardware design but also substantial development efforts.We propose HP-GNN, a novel framework that generates high throughput GNN training implementations on a given CPU-FPGA platform that can benefit both application developers and machine learning researchers. HP-GNN takes GNN training algorithms, GNN models as the inputs, and automatically performs hardware mapping onto the target CPU-FPGA platform. HP-GNN consists of:(1) data layout and internal representation that reduce the memory traffic and random memory accesses; (2) optimized hardware templates that support various GNN models; (3) a design space exploration engine for automatic hardware mapping; (4) high-level application programming interfaces (APIs) that allows users to specify GNN training with only a handful of lines of code. To evaluate HP-GNN, we experiment with two well-known sampling-based GNN training algorithms and two GNN models. For each training algorithm and model, HP-GNN generates implementation on a state-of-the-art CPU-FPGA platform. Compared with CPU-only and CPU-GPU platforms, experimental results show that the generated implementations achieve 55.67× and 2.17× speedup on the average, respectively. Compared with the state-of-the-art GNN training implementations, HP-GNN achieves up to 4.45× speedup. CCS CONCEPTS• Hardware → Reconfigurable logic applications; Hardware accelerators.

show abstract

“…A lightweight opensource RISC-V core [54] is used for programmability and control. The access engine [55] is customized for low-latency sampling and support out-of-order request for latency hiding. For computation, FIGURE 5.…”

Section: Evaluation a Evaluation Setup 1) System Configurationmentioning

confidence: 99%

“…1, the sampling phase gathers the graph structure and features from local and remote machines for the subsequent aggregation and combination phase. During graph sampling, nearly 48% of the memory accesses [55] are for graph structure (e.g. node ID, edge offset of the CSR-formatted adjacent matrix), and this kind of memory access is small in size (8 -64Bytes) and discontinuous.…”

Section: Introductionmentioning

confidence: 99%

Practical Near-Data-Processing Architecture for Large-Scale Distributed Graph Neural Network

Huang

Zhang

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Graph Neural Networks have drawn tremendous attention in the past few years due to their convincing performance and high interpretability in various graph-based tasks like link prediction and node classification. With the ever-growing graph size in the real world, especially for industrial graphs at a billionlevel, the storage of graphs can easily consume Terabytes so that the process of GNNs has to be processed in a distributed manner. As a result, the execution could be inefficient due to the expensive cross-node communication and irregular memory access. Various GNN accelerators have been proposed for efficient GNN processing. They, however, mainly focused on small and medium-size graphs, which is not applicable to large-scale distributed graphs. In this paper, we present a practical Near-Data-Processing architecture based on a memory-pool system for large-scale distributed GNNs. We propose a customized memory fabric interface to construct the memory pool for low-latency and high throughput cross-node communication, which can provide flexible memory allocation and strong scalability. A practical Near-Data-Processing design is proposed for efficient work offloading and bandwidth utilization improvement. Moreover, we also introduce a partition and scheduling scheme to further improve performance and achieve workload balance. Comprehensive evaluations demonstrate that the proposed architecture can achieve up to 27× and 8× higher training speed compared to two state-of-the-art distributed GNN frameworks: Deep Graph Library and P 3 , respectively.INDEX TERMS Graph neural network, large-scale graph processing, memory pool, near data processing.

show abstract

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Cited by 14 publications

References 52 publications

PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs

PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs

HP-GNN: Generating High Throughput GNN Training Implementation on CPU-FPGA Heterogeneous Platform

Practical Near-Data-Processing Architecture for Large-Scale Distributed Graph Neural Network

Contact Info

Product

Resources

About