Network-accelerated non-contiguous memory transfers

Girolamo, Salvatore Di; Taranov, Konstantin; Kurth, Andreas; Schaffner, Michael; Schneider, Timo; Beránek, Jakub; Besta, Maciej; Benini, Luca; Roweth, Duncan; Hoefler, Torsten

doi:10.1145/3295500.3356189

Cited by 17 publications

(9 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the majority of them focus on training, leaving much room for developing efficient distributed-memory frameworks and techniques for GNN inference. We also note high potential in incorporating high-performance interconnect related mechanisms such as Remote Direct Memory Access (RDMA) [87], SmartNICs [28], [74], [106], or novel network topologies and routing [26], [34] into the GNN domain.…”

Section: Multi-machine Parallelismmentioning

confidence: 99%

“…• Incorporating high-performance distributed-memory capabilities CAGNET [199] illustrated how to scalably execute GNN training across many compute nodes. It would be interesting to push this direction and use high-performance distributed-memory developments and interconnects, and the associated mechanisms for more performance of distributed-memory GNN computations, using -for example -RDMA and RMA programming [87], [179], SmartNICs [28], [74], serverless computing [67], of high-performance networking architectures [20], [24], [26], [34]. Such techniques have been successfully used to accelerate the related graph processing field [191].…”

Section: • Parallelization Of Gnn Models Beyond Simple C-gnnsmentioning

confidence: 99%

See 1 more Smart Citation

Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis

Besta¹,

Hoefler²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Graph neural networks (GNNs) are among the most powerful tools in deep learning. They routinely solve complex problems on unstructured networks, such as node classification, graph classification, or link prediction, with high accuracy. However, both inference and training of GNNs are complex, and they uniquely combine the features of irregular graph processing with dense and regular computations. This complexity makes it very challenging to execute GNNs efficiently on modern massively parallel architectures. To alleviate this, we first design a taxonomy of parallelism in GNNs, considering data and model parallelism, and different forms of pipelining. Then, we use this taxonomy to investigate the amount of parallelism in numerous GNN models, GNN-driven machine learning tasks, software frameworks, or hardware accelerators. We use the work-depth model, and we also assess communication volume and synchronization. We specifically focus on the sparsity/density of the associated tensors, in order to understand how to effectively apply techniques such as vectorization. We also formally analyze GNN pipelining, and we generalize the established Message-Passing class of GNN models to cover arbitrary pipeline depths, facilitating future optimizations. Finally, we investigate different forms of asynchronicity, navigating the path for future asynchronous parallel GNN pipelines. The outcomes of our analysis are synthesized in a set of insights that help to maximize GNN performance, and a comprehensive list of challenges and opportunities for further research into efficient GNN computations. Our work will help to advance the design of future GNNs.

show abstract

Section: Multi-machine Parallelismmentioning

confidence: 99%

Section: • Parallelization Of Gnn Models Beyond Simple C-gnnsmentioning

confidence: 99%

Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis

Besta¹,

Hoefler²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…It defines a flexible and programmable network instruction set architecture (NISA) that not only lowers the barrier of entry but also supports a large set of use-cases [28]. For example, Di Girolamo et al demonstrate up to 10x speedups for serialization and deserialization (marshalling) of non-consecutive data [20].…”

Section: Motivationmentioning

confidence: 99%

A RISC-V in-network accelerator for flexible high-performance low-power packet processing

Girolamo

Kurth

Calotoiu

et al. 2021

2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)

Self Cite

View full text Add to dashboard Cite

The capacity of offloading data and control tasks to the network is becoming increasingly important, especially if we consider the faster growth of network speed when compared to CPU frequencies. In-network compute alleviates the host CPU load by running tasks directly in the network, enabling additional computation/communication overlap and potentially improving overall application performance. However, sustaining bandwidths provided by next-generation networks, e.g., 400 Gbit/s, can become a challenge. sPIN is a programming model for in-NIC compute, where users specify handler functions that are executed on the NIC, for each incoming packet belonging to a given message or flow. It enables a CUDA-like acceleration, where the NIC is equipped with lightweight processing elements that process network packets in parallel. We investigate the architectural specialties that a sPIN NIC should provide to enable high-performance, low-power, and flexible packet processing. We introduce PsPIN, a first open-source sPIN implementation, based on a multi-cluster RISC-V architecture and designed according to the identified architectural specialties. We investigate the performance of PsPIN with cycle-accurate simulations, showing that it can process packets at 400 Gbit/s for several use cases, introducing minimal latencies (26 ns for 64 B packets) and occupying a total area of 18.5 mm 2 (22 nm FDSOI).

show abstract

“…The FPGA community has recently gained interest in processing graphs [18-20, 23, 25, 27-31, 63, 125] and other forms of general irregular computations [21,22,24,53,61,82,119,120,129]. First, some established CPU-related schemes were ported to the FPGA setting, for example vertexcentric [57,58], GAS [145], edge-centric [149], BSP [78], and MapReduce [141].…”

Section: Graph Processing On Fpgasmentioning

confidence: 99%

Substream-Centric Maximum Matchings on FPGA

Besta

Fischer

Ben-Nun

et al. 2019

Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs; it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we forego popular graph processing paradigms, such as vertex-centric programming, that often entail large communication costs. Instead, we propose a substream-centric approach, in which the input stream of data is divided into substreams processed independently to enable more parallelism while lowering communication costs. We base our work on the theory of streaming graph algorithms and analyze 14 models and 28 algorithms. We use this analysis to provide theoretical underpinning that matches the physical constraints of FPGA platforms. Our algorithm delivers high performance (more than 4× speedup over tuned parallel CPU variants), low memory, high accuracy, and effective usage of FPGA resources. The substream-centric approach could easily be extended to other algorithms to offer low-power and high-performance graph processing on FPGAs.

show abstract

Network-accelerated non-contiguous memory transfers

Cited by 17 publications

References 45 publications

Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis

Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis

A RISC-V in-network accelerator for flexible high-performance low-power packet processing

Substream-Centric Maximum Matchings on FPGA

Contact Info

Product

Resources

About