Linyong Huang scite author profile

Linyong Huang

4Publications

10Citation Statements Received

110Citation Statements Given

How they've been cited

How they cite others

141

107

Affiliations

Zhejiang University, Alibaba Group (United States)

Publications

Order By: Most citations

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

Niu

Wang

et al. 2022

View full text Add to dashboard Cite

Graph neural network (GNN) is a promising emerging application for link prediction, recommendation, etc. Existing hardware innovation is limited to single-machine GNN (SM-GNN), however, the enterprises usually adopt huge graph with large-scale distributed GNN (LSD-GNN) that has to be carried out with distributed inmemory storage. The LSD-GNN is very different from SM-GNN in terms of system architecture demand, workflow and operators, and hence characterizations.In this paper, we first quantitively characterize the LSD-GNN with industrial-grade framework and application, summarize that its challenges lie in graph sampling, including distributed graph access, long latency, and underutilized communication and memory bandwidth. These challenges are missing from previous SM-GNN targeted researches. We then propose a customized hardware architecture to solve the challenges, including a fully pipelined access engine architecture for graph access and sampling, a low-latency and bandwidth-efficient customized memory-over-fabric hardware, and a RISC-V centric control system providing good programmability. We implement the proposed architecture with full software support in a 4-card FPGA heterogeneous proof-of-concept (PoC) system. Based on the measurement result from the FPGA PoC, we demonstrate a single FPGA can provide up to 894 vCPU's sampling capability. With the goal of being profitable, programmable, and scalable, we further integrate the architecture to FPGA cloud (FaaS) at hyperscale, along with the industrial software framework. We explicitly explore eight FaaS architectures that carry out the proposed accelerator hardware. We finally conclude that off-the-shelf FaaS.base can already provide 2.47× performance per dollar improvement with our hardware. With architecture optimizations, FaaS.comm-opt with customized FPGA fabrics pushes the benefit to 7.78×, and FaaS.mem-opt with FPGA local DRAM and high-speed links to GPU further unleash the benefit to 12.58×.

show abstract

EPQuant: A Graph Neural Network compression approach based on product quantization

Huang

Zhang

et al. 2022

Neurocomputing

View full text Add to dashboard Cite

Practical Near-Data-Processing Architecture for Large-Scale Distributed Graph Neural Network

Huang

Zhang

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Graph Neural Networks have drawn tremendous attention in the past few years due to their convincing performance and high interpretability in various graph-based tasks like link prediction and node classification. With the ever-growing graph size in the real world, especially for industrial graphs at a billionlevel, the storage of graphs can easily consume Terabytes so that the process of GNNs has to be processed in a distributed manner. As a result, the execution could be inefficient due to the expensive cross-node communication and irregular memory access. Various GNN accelerators have been proposed for efficient GNN processing. They, however, mainly focused on small and medium-size graphs, which is not applicable to large-scale distributed graphs. In this paper, we present a practical Near-Data-Processing architecture based on a memory-pool system for large-scale distributed GNNs. We propose a customized memory fabric interface to construct the memory pool for low-latency and high throughput cross-node communication, which can provide flexible memory allocation and strong scalability. A practical Near-Data-Processing design is proposed for efficient work offloading and bandwidth utilization improvement. Moreover, we also introduce a partition and scheduling scheme to further improve performance and achieve workload balance. Comprehensive evaluations demonstrate that the proposed architecture can achieve up to 27× and 8× higher training speed compared to two state-of-the-art distributed GNN frameworks: Deep Graph Library and P 3 , respectively.INDEX TERMS Graph neural network, large-scale graph processing, memory pool, near data processing.

show abstract

OpSparse: A Highly Optimized Framework for Sparse General Matrix Multiplication on GPUs

Guan

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in many real-world applications. Due to SpGEMM's inherent irregularity and the vast diversity of its input matrices, developing high-performance SpGEMM implementation on modern processors such as GPUs is challenging. The state-of-the-art SpGEMM libraries (i.e., nsparse and spECK) adopt several algorithms to tackle the challenges of global load balance, local load balance, and allocation of the result matrix. While these libraries focus on the high-level algorithm design for SpGEMM, they neglect several low-level architecture-specific optimizations, which causes inefficient implementations in their libraries. In this paper, we classify their inefficient implementations into several categories. Based on our observations, we propose a highly optimized SpGEMM library called OpSparse. The optimizations in OpSparse include 1) optimizing the binning method by improving the utilization of the shared memory, 2) optimizing the hashing method by reducing the access to the hash table, 3) improving the trade-off between hash collision rate and hardware utilization in the hashing method by setting appropriate binning ranges, 4) reducing the overheads of global memory utilization by minimizing the global memory usage of the metadata, and 5) improving the execution parallelism by overlapping global memory allocation with kernel execution. Performance evaluations with 26 commonly used matrices on an Nvidia Tesla V100 GPU show that OpSparse achieves on average 7.35× (up to 27.8×), 1.43× (up to 1.81×), and 1.52× (up to 2.04×) speedups over three state-of-the-art SpGEMM libraries: cuSPARSE , nsparse, and spECK, respectively.INDEX TERMS Sparse general matrix multiplication, SpGEMM, GPU, high-performance computing.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Linyong Huang

Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network

EPQuant: A Graph Neural Network compression approach based on product quantization

Practical Near-Data-Processing Architecture for Large-Scale Distributed Graph Neural Network

OpSparse: A Highly Optimized Framework for Sparse General Matrix Multiplication on GPUs

Contact Info

Product

Resources

About