Graph neural network (GNN) is a promising emerging application for link prediction, recommendation, etc. Existing hardware innovation is limited to single-machine GNN (SM-GNN), however, the enterprises usually adopt huge graph with large-scale distributed GNN (LSD-GNN) that has to be carried out with distributed inmemory storage. The LSD-GNN is very different from SM-GNN in terms of system architecture demand, workflow and operators, and hence characterizations.In this paper, we first quantitively characterize the LSD-GNN with industrial-grade framework and application, summarize that its challenges lie in graph sampling, including distributed graph access, long latency, and underutilized communication and memory bandwidth. These challenges are missing from previous SM-GNN targeted researches. We then propose a customized hardware architecture to solve the challenges, including a fully pipelined access engine architecture for graph access and sampling, a low-latency and bandwidth-efficient customized memory-over-fabric hardware, and a RISC-V centric control system providing good programmability. We implement the proposed architecture with full software support in a 4-card FPGA heterogeneous proof-of-concept (PoC) system. Based on the measurement result from the FPGA PoC, we demonstrate a single FPGA can provide up to 894 vCPU's sampling capability. With the goal of being profitable, programmable, and scalable, we further integrate the architecture to FPGA cloud (FaaS) at hyperscale, along with the industrial software framework. We explicitly explore eight FaaS architectures that carry out the proposed accelerator hardware. We finally conclude that off-the-shelf FaaS.base can already provide 2.47× performance per dollar improvement with our hardware. With architecture optimizations, FaaS.comm-opt with customized FPGA fabrics pushes the benefit to 7.78×, and FaaS.mem-opt with FPGA local DRAM and high-speed links to GPU further unleash the benefit to 12.58×.
Graph Neural Networks have drawn tremendous attention in the past few years due to their convincing performance and high interpretability in various graph-based tasks like link prediction and node classification. With the ever-growing graph size in the real world, especially for industrial graphs at a billionlevel, the storage of graphs can easily consume Terabytes so that the process of GNNs has to be processed in a distributed manner. As a result, the execution could be inefficient due to the expensive cross-node communication and irregular memory access. Various GNN accelerators have been proposed for efficient GNN processing. They, however, mainly focused on small and medium-size graphs, which is not applicable to large-scale distributed graphs. In this paper, we present a practical Near-Data-Processing architecture based on a memory-pool system for large-scale distributed GNNs. We propose a customized memory fabric interface to construct the memory pool for low-latency and high throughput cross-node communication, which can provide flexible memory allocation and strong scalability. A practical Near-Data-Processing design is proposed for efficient work offloading and bandwidth utilization improvement. Moreover, we also introduce a partition and scheduling scheme to further improve performance and achieve workload balance. Comprehensive evaluations demonstrate that the proposed architecture can achieve up to 27× and 8× higher training speed compared to two state-of-the-art distributed GNN frameworks: Deep Graph Library and P 3 , respectively.INDEX TERMS Graph neural network, large-scale graph processing, memory pool, near data processing.
Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in many real-world applications. Due to SpGEMM's inherent irregularity and the vast diversity of its input matrices, developing high-performance SpGEMM implementation on modern processors such as GPUs is challenging. The state-of-the-art SpGEMM libraries (i.e., nsparse and spECK) adopt several algorithms to tackle the challenges of global load balance, local load balance, and allocation of the result matrix. While these libraries focus on the high-level algorithm design for SpGEMM, they neglect several low-level architecture-specific optimizations, which causes inefficient implementations in their libraries. In this paper, we classify their inefficient implementations into several categories. Based on our observations, we propose a highly optimized SpGEMM library called OpSparse. The optimizations in OpSparse include 1) optimizing the binning method by improving the utilization of the shared memory, 2) optimizing the hashing method by reducing the access to the hash table, 3) improving the trade-off between hash collision rate and hardware utilization in the hashing method by setting appropriate binning ranges, 4) reducing the overheads of global memory utilization by minimizing the global memory usage of the metadata, and 5) improving the execution parallelism by overlapping global memory allocation with kernel execution. Performance evaluations with 26 commonly used matrices on an Nvidia Tesla V100 GPU show that OpSparse achieves on average 7.35× (up to 27.8×), 1.43× (up to 1.81×), and 1.52× (up to 2.04×) speedups over three state-of-the-art SpGEMM libraries: cuSPARSE , nsparse, and spECK, respectively.INDEX TERMS Sparse general matrix multiplication, SpGEMM, GPU, high-performance computing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.