Proceedings of the 25th European MPI Users' Group Meeting 2018
DOI: 10.1145/3236367.3236381
|View full text |Cite
|
Sign up to set email alerts
|

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters

Abstract: Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such dense Multi-GPU nodes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(8 citation statements)
references
References 24 publications
(31 reference statements)
0
7
0
Order By: Relevance
“…Klenk et al [59] analyzed the exascale proxy applications on their communication patterns and proposed a matching algorithm for GPUs to comply with MPI constraints. Awan et al [60] proposed a pipelined chain design for MPI broadcast collective operations on multi-GPU nodes to facilitate various deep learning frameworks.…”
Section: Multi-node Gpu Computingmentioning
confidence: 99%
“…Klenk et al [59] analyzed the exascale proxy applications on their communication patterns and proposed a matching algorithm for GPUs to comply with MPI constraints. Awan et al [60] proposed a pipelined chain design for MPI broadcast collective operations on multi-GPU nodes to facilitate various deep learning frameworks.…”
Section: Multi-node Gpu Computingmentioning
confidence: 99%
“…With the future availability of MPI-GDS [28], the asynchronous send operations can be triggered directly after the squared absolute values are computed, leading to better hiding of the communication. In addition, also the optimization of collective operations is under investigation [29,30]. Therefore, future library implementations offer the potential to further improve the performance of the proposed implementation.…”
Section: Benchmarkmentioning
confidence: 99%
“…For example, to support this operation, Hadoop introduces Distributed Cache and the size is set to 10 GB by default [27]. However, existing broadcast algorithms are usually designed for messages no larger than hundreds of MBs, and they usually use tree-based logic topology and small-chunk-based pipelining techniques which cause the contention of the bandwidth of a physical link by multiple logic links and high chunking overhead [26], [28], [29]. To fully utilize each cable's bidirectional bandwidth and the aggregate bandwidth of clusters, and avoid the chunking overhead of pipelining, we propose a Fast BroadCast algorithm (FastBC).…”
Section: Fast Broadcast Algorithmmentioning
confidence: 99%