GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Potluri, Sreeram; Goswami, Anshuman; Rossetti, Davide; Newburn, Chris J.; Venkata, Manjunath Gorentla; Imam, Neena

doi:10.1109/hipc.2017.00037

Cited by 22 publications

(16 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2) NVSHMEM Perftest: As discussed in Section II, the NVSHMEM [3,12] communication library provides lightweight communication operations for accessing GPU memory. NVSHMEM is written using the IBV interface.…”

Section: A Benchmarksmentioning

confidence: 99%

“…Moving data between accelerator memories has been a significant bottleneck in distributed computing environments [20,21]. Unlike earlier systems that rely mainly on CPU-initiated mechanisms [20], moving data residing on accelerator memories has recently involved novel mechanisms, including deviceinitiated [3,12,[22][23][24] and hardware transparent migration using unified memory models [25,26].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches

Groves

Brock

Chen

et al. 2020

2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

View full text Add to dashboard Cite

Network communication on GPU-based systems is a significant roadblock for many applications with small but frequent messaging requirements. One common question for application developers is, "How can they reduce the overheads and achieve the best communication performance on GPUs?" This work examines device initiated versus host initiated internode GPU communication using NVSHMEM. We derive basic communication model parameters for single message and batched communication before validating our model against distributed GEMM benchmarks. We use our model to estimate performance benefits for applications transitioning from CPUs to GPUS for fixed-size and scaled workloads and provide general guidelines for reducing communication overheads. Our findings show that the host-initiated approach generally outperforms the deviceinitiated approach for the system evaluated.

show abstract

Section: A Benchmarksmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches

Groves

Brock

Chen

et al. 2020

2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

View full text Add to dashboard Cite

show abstract

“…After fulfilling data transfer between CPUs, the function cudaMcemcpyDevice-ToHost is called to transfer the data from CPU to the target GPU. The latest device introduced by NVIDIA Corporation is Tesla V100, which provides an NVLink bus technique [32] to achieve communication between GPUs directly.…”

Section: Cuda and Gpu Parallel Algorithm For Cfdmentioning

confidence: 99%

A Multi-GPU Parallel Algorithm in Hypersonic Flow Computations

Lai

Tian

et al. 2019

Mathematical Problems in Engineering

View full text Add to dashboard Cite

Computational fluid dynamics (CFD) plays an important role in the optimal design of aircraft and the analysis of complex flow mechanisms in the aerospace domain. The graphics processing unit (GPU) has a strong floating-point operation capability and a high memory bandwidth in data parallelism, which brings great opportunities for CFD. A cell-centred finite volume method is applied to solve three-dimensional compressible Navier–Stokes equations on structured meshes with an upwind AUSM+UP numerical scheme for space discretization, and four-stage Runge–Kutta method is used for time discretization. Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for GPUs, which reduces the complexity of programming. The main purpose of this paper is to design an extremely efficient multi-GPU parallel algorithm based on MPI+CUDA to study the hypersonic flow characteristics. Solutions of hypersonic flow over an aerospace plane model are provided at different Mach numbers. The agreement between numerical computations and experimental measurements is favourable. Acceleration performance of the parallel platform is studied with single GPU, two GPUs, and four GPUs. For single GPU implementation, the speedup reaches 63 for the coarser mesh and 78 for the finest mesh. GPUs are better suited for compute-intensive tasks than traditional CPUs. For multi-GPU parallelization, the speedup of four GPUs reaches 77 for the coarser mesh and 147 for the finest mesh; this is far greater than the acceleration achieved by single GPU and two GPUs. It is prospective to apply the multi-GPU parallel algorithm to hypersonic flow computations.

show abstract

“…GPU Global Address Space (GGAS) [27] implements intra-kernel networking by adding explicit hardware in the GPU to support a clusterwide global address space. Oden et al [29], GPUrdma [10], and Potluri et al [32] all explore techniques to implement IniniBand entirely on the GPU. Unfortunately, these works either have challenges with performance [29] or data visibility [10,32] related to the GPU's relaxed memory consistency model.…”

Section: Related Workmentioning

confidence: 99%

“…Oden et al [29], GPUrdma [10], and Potluri et al [32] all explore techniques to implement IniniBand entirely on the GPU. Unfortunately, these works either have challenges with performance [29] or data visibility [10,32] related to the GPU's relaxed memory consistency model. Klenk et al [17,18] explore a number of techniques and communication models to support communication directly from the GPU and show good performance in a number of cases.…”

Section: Related Workmentioning

confidence: 99%

ComP-net

LeBeane

Hamidouche

Benton

et al. 2018

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Current state-of-the-art in GPU networking advocates a hostcentric model that reduces performance and increases code complexity. Recently, researchers have explored several techniques for networking within a GPU kernel itself. These approaches, however, sufer from high latency, waste energy on the host, and are not scalable with larger/more GPUs on a node. In this work, we introduce Command Processor Networking (ComP-Net), which leverages the availability of scalar cores integrated on the GPU itself to provide highperformance intra-kernel networking. ComP-Net enables eicient synchronization between the Command Processors and Compute Units on the GPU through a line locking scheme implemented in the GPU's shared last-level cache. We illustrate that ComP-Net can improve application performance by up to 20% and provide up to 50% reduction in energy consumption vs. competing networking techniques across a Jacobi stencil, allreduce collective, and machine learning applications. CCS CONCEPTS • Computer systems organization → Heterogeneous (hybrid) systems;

show abstract

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Cited by 22 publications

References 8 publications

Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches

Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches

A Multi-GPU Parallel Algorithm in Hypersonic Flow Computations

ComP-net

Contact Info

Product

Resources

About