2019
DOI: 10.1002/cpe.5574
|View full text |Cite
|
Sign up to set email alerts
|

Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters

Abstract: Training models on large-scale GPUs-accelerated clusters are becoming a commonplace due to the increase in complexity and size in Deep Learning models. One of the main challenges for distributed training is the collective communication overhead for large message sizes: up to hundreds of MB. In this paper, we propose two hierarchical distributed memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab). In which, GPUs inside a computing node perform an intra-node c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(1 citation statement)
references
References 40 publications
(106 reference statements)
0
0
0
Order By: Relevance
“…MPI provides a powerful set of communication and synchronization mechanisms, enabling efficient communication and collaboration in parallel programs. Among these mechanisms, the Allreduce operation in MPI holds significant importance [2] . It is used for data reduction among multiple processes and is commonly employed for calculations such as summation and finding the maximum value.…”
Section: Introductionmentioning
confidence: 99%
“…MPI provides a powerful set of communication and synchronization mechanisms, enabling efficient communication and collaboration in parallel programs. Among these mechanisms, the Allreduce operation in MPI holds significant importance [2] . It is used for data reduction among multiple processes and is commonly employed for calculations such as summation and finding the maximum value.…”
Section: Introductionmentioning
confidence: 99%