Rubik: A Hierarchical Architecture for Efficient Graph Learning

Chen, Xiaobing; Wang, Yuke; Xie, Xiujuan; Hu, Xing; Basak, Abanti; Liang, Ling; Yan, Mingyu; Deng, Lei; Ding, Yufei; Du, Zidong; Chen, Tianshi; Xie, Yuan

doi:10.48550/arxiv.2009.12495

Cited by 5 publications

(12 citation statements)

References 37 publications

(34 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2c, mean pool uses many ThCudaTensor_scatterAddKernels which are also present in the Aggregation phase. Again, similar to previous GNN accelerators [3], [4], [11], [21], [27], [39], [41]- [43], this work will focus only on the Aggregation and Combination phases, as the main kernels in aggregation and combination consume a majority of the GNN inference runtime.…”

Section: B Pytorch Geometric Characterizationmentioning

confidence: 99%

“…Some sparsity centric optimizations for Aggregation phase such as workload balancing in AWB-GCN [11] and window shrinking in HyGCN [39] are not captured in the dataflows. Moreover, the description does not capture specific Aggregation optimizations in Rubik [4] where repeated partial sums are reused, thus redundant computations are eliminated [16].…”

Section: F Scope Of Taxonomymentioning

confidence: 99%

“…The acceleration of GNN workloads is an active area of research that distinguishes between software and hardware acceleration [2]. On the one hand, software acceleration for GNNs aims at exploiting the knowledge of the graph properties to better adapt the workload to the underlying hardware [4], [10], [14], [16], [20], [28], [34]- [37], [45]. This includes techniques such as intelligent partitioning [34], sparsity-aware workload management [37], vertex reordering [4], or the caching of partial aggregations to avoid redundant sums [16].…”

Section: Related Workmentioning

confidence: 99%

“…On the one hand, software acceleration for GNNs aims at exploiting the knowledge of the graph properties to better adapt the workload to the underlying hardware [4], [10], [14], [16], [20], [28], [34]- [37], [45]. This includes techniques such as intelligent partitioning [34], sparsity-aware workload management [37], vertex reordering [4], or the caching of partial aggregations to avoid redundant sums [16]. These techniques are either specific for GPUs, such as the dataflow constructs in Neugraph [28], or orthogonal to the dataflow approach.…”

Section: Related Workmentioning

confidence: 99%

“…Computing GNN inference requires a mix of memory and compute intensive operations, which commodity CPUs, GPUs and traditional DNN accelerators do not exploit efficiently [27], [39], [40], [44]. This led to the development of many dedicated GNN accelerators, each with their own design methodology to extract as much performance as possible [3], [4], [11], [21], [27], [39], [41]- [43].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A Taxonomy for Classification and Comparison of Dataflows for GNN Accelerators

Garg¹,

Qin²,

Muñoz-Martínez³

et al. 2021

View full text Add to dashboard Cite

Recently, Graph Neural Networks (GNNs) have received a lot of interest because of their success in learning representations from graph structured data. However, GNNs exhibit different compute and memory characteristics compared to traditional Deep Neural Networks (DNNs). Graph convolutions require feature aggregations from neighboring nodes (known as the aggregation phase), which leads to highly irregular data accesses. GNNs also have a very regular compute phase that can be broken down to matrix multiplications (known as the combination phase). All recently proposed GNN accelerators utilize different dataflows and microarchitecture optimizations for these two phases. Different communication strategies between the two phases have been also used. However, as more custom GNN accelerators are proposed, the harder it is to qualitatively classify them and quantitatively contrast them. In this work, we present a taxonomy to describe several diverse dataflows for running GNN inference on accelerators. This provides a structured way to describe and compare the design-space of GNN accelerators.

show abstract

Section: B Pytorch Geometric Characterizationmentioning

confidence: 99%

Section: F Scope Of Taxonomymentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Taxonomy for Classification and Comparison of Dataflows for GNN Accelerators

Garg¹,

Qin²,

Muñoz-Martínez³

et al. 2021

View full text Add to dashboard Cite

show abstract

I-GCN: A Graph Convolutional Network Accelerator with Runtime Locality Enhancement through Islandization

Geng

Zhang

et al. 2021

MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

Graph Convolutional Networks (GCNs) have drawn tremendous attention in the past three years. Compared with other deep learning modalities, high-performance hardware acceleration of GCNs is as critical but even more challenging. The hurdles arise from the poor data locality and redundant computation due to the large size, high sparsity, and irregular non-zero distribution of real-world graphs.In this paper we propose a novel hardware accelerator for GCN inference, called I-GCN, that significantly improves data locality and reduces unnecessary computation. The mechanism is a new online graph restructuring algorithm we refer to as islandization. The proposed algorithm finds clusters of nodes with strong internal but weak external connections. The islandization process yields two major benefits. First, by processing islands rather than individual nodes, there is better on-chip data reuse and fewer off-chip memory accesses. Second, there is less redundant computation as aggregation for common/shared neighbors in an island can be reused. The parallel search, identification, and leverage of graph islands are all handled purely in hardware at runtime working in an incremental pipeline. This is done without any preprocessing of the graph data or adjustment of the GCN model structure. Experimental results show that I-GCN can significantly reduce off-chip accesses and prune 38% of aggregation operations, leading to performance speedups over CPUs, GPUs, the prior art GCN accelerators of 5549×, 403×, and 5.7× on average, respectively. CCS CONCEPTS• Computer systems organization → Neural networks; Parallel architectures; • Computing methodologies → Parallel algorithms.

show abstract

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

You¹,

Geng²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art graph learning model. However, it can be notoriously challenging to inference GCNs over large graph datasets, limiting their application to large real-world graphs and hindering the exploration of deeper and more sophisticated GCN graphs. This is because real-world graphs can be extremely large and sparse. Furthermore, the node degree of GCNs tends to follow the power-law distribution and therefore have highly irregular adjacency matrices, resulting in prohibitive inefficiencies in both data processing and movement and thus substantially limiting the achievable GCN acceleration efficiency. To this end, this paper proposes a GCN algorithm and accelerator Co-Design framework dubbed GCoD which can largely alleviate the aforementioned GCN irregularity and boost GCNs' inference efficiency. Specifically, on the algorithm level, GCoD integrates a split and conquer GCN training strategy that polarizes the graphs to be either denser or sparser in local neighborhoods without compromising the model accuracy, resulting in graph adjacency matrices that (mostly) have merely two levels of workload and enjoys largely enhanced regularity and thus ease of acceleration. On the hardware level, we further develop a dedicated twopronged accelerator with a separated engine to process each of the aforementioned denser and sparser workloads, further boosting the overall utilization and acceleration efficiency. Extensive experiments and ablation studies validate that our GCoD consistently reduces the number of off-chip accesses, leading to speedups 15286×, 294×, 7.8×, and 2.5× as compared to CPUs, GPUs, and prior-art GCN accelerators including HyGCN and AWB-GCN, respectively, while maintaining or even improving the task accuracy. Additionally, we visualize GCoD trained graph adjacency matrices for a better understanding of its advantages.

show abstract

Rubik: A Hierarchical Architecture for Efficient Graph Learning

Cited by 5 publications

References 37 publications

A Taxonomy for Classification and Comparison of Dataflows for GNN Accelerators

A Taxonomy for Classification and Comparison of Dataflows for GNN Accelerators

I-GCN: A Graph Convolutional Network Accelerator with Runtime Locality Enhancement through Islandization

GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

Contact Info

Product

Resources

About