Proceedings of the 23rd European MPI Users' Group Meeting 2016
DOI: 10.1145/2966884.2966912
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
15
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 42 publications
(18 citation statements)
references
References 10 publications
0
15
0
Order By: Relevance
“…Figure 6 shows the current positioning of Cylon in deep learning integration. To further enhance the distributed operations, we can add specific support to deep learning settings such as NCCL [19].…”
Section: Transport Layermentioning
confidence: 99%
“…Figure 6 shows the current positioning of Cylon in deep learning integration. To further enhance the distributed operations, we can add specific support to deep learning settings such as NCCL [19].…”
Section: Transport Layermentioning
confidence: 99%
“…For large and very large message range, we see that NCCL provides scalable performance. At the same time, our proposed pipelined chain designs in MVAPICH2-GDR allow us to achieve similar or better performance essentially alleviating the need to resort to NCCL augmented broadcast designs proposed in [4].…”
Section: B Intranode Performance Comparison (Micro-benchmark)mentioning
confidence: 91%
“…CUDA-Aware MPI runtimes like MVAPICH2-GDR are flexible enough to integrate third-party libraries like NCCL. In this context, we designed and evaluated NCCL-based MPI Bcast designs in our earlier work [4]. The hierarchical nature of collective communication in MVAPICH2 allowed us to exploit NCCL for intranode communication along with efficient and tuned designs for internode communication.…”
Section: Limitations Of Nccl-integrated Mpi Designsmentioning
confidence: 99%
See 1 more Smart Citation
“…For instance, Glaser et al [6] implement strong scaling versions of general-purpose molecular dynamics simulations on GPUs, and Lončar et al use it too in the aforementioned solver from [10]. Deep learning and data analytics are other scopes that are taking advantage of CUDA-aware MPI implementations, for example by exploiting it to support efficient large message broadcast operations [1]. That work also exploits NCCL in order to optimize intra-node communications among directly-connected GPUs.…”
Section: Related Workmentioning
confidence: 99%