Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters

Awan, Ammar Ahmad; Chu, C. Y. Cyrus; Subramoni, Hari; Panda, Dhabaleswar K.

doi:10.1145/3236367.3236381

Cited by 26 publications

(8 citation statements)

References 24 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Klenk et al [59] analyzed the exascale proxy applications on their communication patterns and proposed a matching algorithm for GPUs to comply with MPI constraints. Awan et al [60] proposed a pipelined chain design for MPI broadcast collective operations on multi-GPU nodes to facilitate various deep learning frameworks.…”

Section: Multi-node Gpu Computingmentioning

confidence: 99%

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Song

Chen

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

153

View full text Add to dashboard Cite

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

show abstract

Section: Multi-node Gpu Computingmentioning

confidence: 99%

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Song

Chen

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

153

View full text Add to dashboard Cite

show abstract

“…With the future availability of MPI-GDS [28], the asynchronous send operations can be triggered directly after the squared absolute values are computed, leading to better hiding of the communication. In addition, also the optimization of collective operations is under investigation [29,30]. Therefore, future library implementations offer the potential to further improve the performance of the proposed implementation.…”

Section: Benchmarkmentioning

confidence: 99%

Simulation of nonlinear signal propagation in multimode fibers on multi-GPU systems

Brehler

Schirwon

Krummrich

et al. 2020

Communications in Nonlinear Science and Numerical Simulation

View full text Add to dashboard Cite

Mode-division multiplexing (MDM) is seen as a possible solution to satisfy the rising capacity demands of optical communication networks. To make MDM a success, fibers supporting the propagation of a huge number of modes are of interest. Many of the system aspects occurring during the propagation can be evaluated by using appropriate models. However, fibers are a nonlinear medium and, therefore, numerical simulations are required. For a large number of modes, the simulation of the nonlinear signal propagation leads to new challenges, for example regarding the required memory, which we address with an implementation incorporating multiple GPU-accelerators. Within this paper, we evaluate two different approaches to realize the communication between the GPUs and analyze the performance for simulations involving up to 8 Tesla GPUs. We show results for a MDM transmission system utilizing the extremely large but practically very relevant number of 120 spatial modes as an application example and analyze the impact of the nonlinear effects on the transmitted signals.

show abstract

“…For example, to support this operation, Hadoop introduces Distributed Cache and the size is set to 10 GB by default [27]. However, existing broadcast algorithms are usually designed for messages no larger than hundreds of MBs, and they usually use tree-based logic topology and small-chunk-based pipelining techniques which cause the contention of the bandwidth of a physical link by multiple logic links and high chunking overhead [26], [28], [29]. To fully utilize each cable's bidirectional bandwidth and the aggregate bandwidth of clusters, and avoid the chunking overhead of pipelining, we propose a Fast BroadCast algorithm (FastBC).…”

Section: Fast Broadcast Algorithmmentioning

confidence: 99%

LiteTE: Lightweight, Communication-Efficient Distributed-Memory Triangle Enumerating

et al. 2019

View full text Add to dashboard Cite

Distributed-memory triangle enumerating has attracted considerable interests due to its potential capability to process huge graphs quickly. However, existing algorithms suffer from low speed due to high communication cost and load imbalance. To solve the problems, we propose LiteTE, a lightweight, communication-efficient triangle enumerating scheme. To reduce communication cost, LiteTE proposes several techniques, including a graph partitioning method to fully leverage the large memory of commodity servers and the high bandwidth of modern networks and a fast broadcast algorithm to effectively utilize the bidirectional bandwidth of cables and the aggregate bandwidth of clusters. To reduce load imbalance, LiteTE proposes three-level techniques, including a codesign technique of graph partitioning and partition-level load balance, a decentralized dynamic node-level load balance technique, and a chunk-based lock-free work-stealing technique, all of which are lightweight and incur no or hardly any communication cost. The experimental results show that LiteTE reduces communication cost and load imbalance considerably and achieves much better performance in metrics, such as setup time, runtime, scalability, and load balance than the state-of-the-art algorithms. On a small-scale cluster, LiteTE enumerates the 15 trillion triangles in a graph of 92 billion edges in 15 min, while other algorithms fail to complete. INDEX TERMS Triangle enumerating, triangle computation, graph processing, distributed computing, parallel processing.

show abstract

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters

Cited by 26 publications

References 24 publications

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Simulation of nonlinear signal propagation in multimode fibers on multi-GPU systems

LiteTE: Lightweight, Communication-Efficient Distributed-Memory Triangle Enumerating

Contact Info

Product

Resources

About