Rubik: A Hierarchical Architecture for Efficient Graph Neural Network Training

Chen, Xiaobing; Wang, Yuke; Xie, Xiujuan; Hu, Xing; Basak, Abanti; Liang, Ling; Yan, Mingyu; Deng, Lei; Ding, Yufei; Du, Zidong; Xie, Yuan

doi:10.1109/tcad.2021.3079142

Cited by 30 publications

(7 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The current taxonomy captures the intra-phase dataflows and the inter-phase dataflows. However, our taxonomy does not capture the order of nodes, graph partitioning and optimizations such as load balancing [20], computation elimination via memoizing [23], [36] and requires an extension to capture these.…”

Section: Discussion and Future Workmentioning

confidence: 99%

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators

Garg¹,

Qin²,

Muñoz-Martínez³

et al. 2022

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and memory characteristics that come from an interplay between dense and sparse phases of computations, the emergence of reconfigurable dataflow (aka spatial) accelerators offers promise for acceleration by mapping optimized dataflows (i.e., computation order and parallelism) for both phases. The goal of this work is to characterize and understand the design-space of dataflow choices for running GNNs on spatial accelerators in order for mappers or design-space exploration tools to optimize the dataflow based on the workload. Specifically, we propose a taxonomy to describe all possible choices for mapping the dense and sparse phases of GNN inference, spatially and temporally over a spatial accelerator, capturing both the intra-phase dataflow and the inter-phase (pipelined) dataflow. Using this taxonomy, we do deep-dives into the cost and benefits of several dataflows and perform case studies on implications of hardware parameters for dataflows and value of flexibility to support pipelined execution.

show abstract

Section: Discussion and Future Workmentioning

confidence: 99%

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators

Garg¹,

Qin²,

Muñoz-Martínez³

et al. 2022

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

show abstract

“…An increasing amount of research dealing with the co-design of software and hardware to accelerate GNN training [23,95,96,203]. Here, not only software and algorithms are optimized, but also hardware modules are developed to better address the characteristics of GNNs.…”

Section: Current Research Trendsmentioning

confidence: 99%

The Evolution of Distributed Systems for Graph Neural Networks and Their Origin in Graph Processing and Deep Learning: A Survey

Vatter,

Mayer,

Jacobsen

2023

ACM Comput. Surv.

View full text Add to dashboard Cite

Graph Neural Networks (GNNs) are an emerging research field. This specialized Deep Neural Network (DNN) architecture is capable of processing graph structured data and bridges the gap between graph processing and Deep Learning (DL). As graphs are everywhere, GNNs can be applied to various domains including recommendation systems, computer vision, natural language processing, biology and chemistry. With the rapid growing size of real world graphs, the need for efficient and scalable GNN training solutions has come. Consequently, many works proposing GNN systems have emerged throughout the past few years. However, there is an acute lack of overview, categorization and comparison of such systems. We aim to fill this gap by summarizing and categorizing important methods and techniques for large-scale GNN solutions. In addition, we establish connections between GNN systems, graph processing systems and DL systems.

show abstract

“…Some of the works like HyGCN [69], AWB [25], and VersaGNN [60] making effort on sparse matrix multiplication, but it is not the workflow for LSD-GNN. Others such as GCNAX [44], BoostGCN [77], Rubik [14], GraphACT [76], GNNSampler [47], and Grip [39,40] optimizing data reuse are not applicable to LSD-GNN either, since the chance to find reuse within 512-node mini-batch compared with 10+ billion total nodes is extremely low. Huang et al [34] works on a similar problem like this paper, but under a different disaggregated memory pool context.…”

Section: Related Workmentioning

confidence: 99%

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Niu

Wang

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

Self Cite

View full text Add to dashboard Cite

Graph neural network (GNN) is a promising emerging application for link prediction, recommendation, etc. Existing hardware innovation is limited to single-machine GNN (SM-GNN), however, the enterprises usually adopt huge graph with large-scale distributed GNN (LSD-GNN) that has to be carried out with distributed inmemory storage. The LSD-GNN is very different from SM-GNN in terms of system architecture demand, workflow and operators, and hence characterizations.In this paper, we first quantitively characterize the LSD-GNN with industrial-grade framework and application, summarize that its challenges lie in graph sampling, including distributed graph access, long latency, and underutilized communication and memory bandwidth. These challenges are missing from previous SM-GNN targeted researches. We then propose a customized hardware architecture to solve the challenges, including a fully pipelined access engine architecture for graph access and sampling, a low-latency and bandwidth-efficient customized memory-over-fabric hardware, and a RISC-V centric control system providing good programmability. We implement the proposed architecture with full software support in a 4-card FPGA heterogeneous proof-of-concept (PoC) system. Based on the measurement result from the FPGA PoC, we demonstrate a single FPGA can provide up to 894 vCPU's sampling capability. With the goal of being profitable, programmable, and scalable, we further integrate the architecture to FPGA cloud (FaaS) at hyperscale, along with the industrial software framework. We explicitly explore eight FaaS architectures that carry out the proposed accelerator hardware. We finally conclude that off-the-shelf FaaS.base can already provide 2.47× performance per dollar improvement with our hardware. With architecture optimizations, FaaS.comm-opt with customized FPGA fabrics pushes the benefit to 7.78×, and FaaS.mem-opt with FPGA local DRAM and high-speed links to GPU further unleash the benefit to 12.58×.

show abstract

Rubik: A Hierarchical Architecture for Efficient Graph Neural Network Training

Cited by 30 publications

References 34 publications

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators

Understanding the Design-Space of Sparse/Dense Multiphase GNN dataflows on Spatial Accelerators

The Evolution of Distributed Systems for Graph Neural Networks and Their Origin in Graph Processing and Deep Learning: A Survey

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Contact Info

Product

Resources

About