GPU triggered networking for intra-kernel communications

LeBeane, Michael; Hamidouche, Khaled; Benton, Brad; Breternitz, Maurício; Reinhardt, Steven K.; John, Lizy K.

doi:10.1145/3126908.3126950

Cited by 16 publications

(8 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The accelerator can employ all the features of the CPU network stack, including NIC hardware offloads, while occupying a relatively small area (see Table 1). However, the CPU is involved in every network transaction, limiting scalability, hurting performance, and wasting CPU cycles [22,60,91].…”

Section: Accelerator Networking Architecturesmentioning

confidence: 99%

See 1 more Smart Citation

FlexDriver: a network driver for your accelerator

Eran¹,

Fudim

Malka

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

We propose a new system design for connecting hardware and FPGA accelerators to the network, allowing the accelerator to directly control commodity Network Interface Cards (NICs) without using the CPU. This enables us to solve the key challenge of leveraging existing NIC hardware offloads such as virtualization, tunneling, and RDMA for accelerator networking. Our approach supports a diverse set of use cases, from direct network access for disaggregated accelerators to inline-acceleration of the network stack, all without the complex networking logic in the accelerator.To demonstrate the feasibility of this approach, we build Flex-Driver (FLD), an on-accelerator hardware module that implements a NIC data-plane driver. Our main technical contribution is a mechanism that compresses the NIC control structures by two orders of magnitude, allowing FLD to achieve high networking scalability with low die area cost and no bandwidth interference with the accelerator logic.The prototype for NVIDIA Innova-2 FPGA SmartNICs showcases our design's utility for three different accelerators: a disaggregated LTE cipher, an IP-defragmentation inline accelerator, and an IoT cryptographic-token authentication offload. These accelerators reach 25 Gbps line rate and leverage the NIC for RDMA processing, VXLAN tunneling, and traffic shaping without CPU involvement. CCS CONCEPTS• Hardware → Hardware accelerators; Networking hardware; • Computer systems organization → Heterogeneous (hybrid) systems; Distributed architectures.

show abstract

Section: Accelerator Networking Architecturesmentioning

confidence: 99%

“…Our work builds upon previous GPU networking work to use NICs directly from GPUs, eliminating the CPU from the critical path [2,22,60,84,91]. These works implement communication tasks in GPGPU cores or CUDA stream MemOps [2].…”

Section: Related Workmentioning

confidence: 99%

FlexDriver: a network driver for your accelerator

Eran¹,

Fudim

Malka

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…The advent of heterogeneous systems, especially with the use of hardware accelerators, brings back to the forefront the modeling question of these complex systems. Moving data between accelerator memories has been a significant bottleneck in distributed computing environments [20,21]. Unlike earlier systems that rely mainly on CPU-initiated mechanisms [20], moving data residing on accelerator memories has recently involved novel mechanisms, including deviceinitiated [3,12,[22][23][24] and hardware transparent migration using unified memory models [25,26].…”

Section: Related Workmentioning

confidence: 99%

Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches

Groves

Brock

Chen

et al. 2020

2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

View full text Add to dashboard Cite

Network communication on GPU-based systems is a significant roadblock for many applications with small but frequent messaging requirements. One common question for application developers is, "How can they reduce the overheads and achieve the best communication performance on GPUs?" This work examines device initiated versus host initiated internode GPU communication using NVSHMEM. We derive basic communication model parameters for single message and batched communication before validating our model against distributed GEMM benchmarks. We use our model to estimate performance benefits for applications transitioning from CPUs to GPUS for fixed-size and scaled workloads and provide general guidelines for reducing communication overheads. Our findings show that the host-initiated approach generally outperforms the deviceinitiated approach for the system evaluated.

show abstract

“…A number of work implement intra-kernel networking while avoiding CPU helper threads. GPU-TN [19] provides an intra-kernel networking scheme by using a mechanism based on Portals 4 triggered operations [35]. GPU Global Address Space (GGAS) [27] implements intra-kernel networking by adding explicit hardware in the GPU to support a clusterwide global address space.…”

Section: Related Workmentioning

confidence: 99%

“…Inter-kernel networking can also impose performance challenges when networking is frequent compared to computation and limits the class of algorithms that can be oloaded to a GPU. To put this into perspective, waiting for kernel tear-down/startup has been shown to take upwards of 10µs [19]. This is an order of magnitude greater than modern network latencies, which hover around 0.7µs at the time of this writing [22].…”

Section: Introductionmentioning

confidence: 99%

ComP-net

LeBeane

Hamidouche

Benton

et al. 2018

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

Current state-of-the-art in GPU networking advocates a hostcentric model that reduces performance and increases code complexity. Recently, researchers have explored several techniques for networking within a GPU kernel itself. These approaches, however, sufer from high latency, waste energy on the host, and are not scalable with larger/more GPUs on a node. In this work, we introduce Command Processor Networking (ComP-Net), which leverages the availability of scalar cores integrated on the GPU itself to provide highperformance intra-kernel networking. ComP-Net enables eicient synchronization between the Command Processors and Compute Units on the GPU through a line locking scheme implemented in the GPU's shared last-level cache. We illustrate that ComP-Net can improve application performance by up to 20% and provide up to 50% reduction in energy consumption vs. competing networking techniques across a Jacobi stencil, allreduce collective, and machine learning applications. CCS CONCEPTS • Computer systems organization → Heterogeneous (hybrid) systems;

show abstract

GPU triggered networking for intra-kernel communications

Cited by 16 publications

References 15 publications

FlexDriver: a network driver for your accelerator

FlexDriver: a network driver for your accelerator

Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches

ComP-net

Contact Info

Product

Resources

About