2009
DOI: 10.1016/j.jpdc.2008.09.002
|View full text |Cite
|
Sign up to set email alerts
|

Bandwidth optimal all-reduce algorithms for clusters of workstations

Abstract: We consider an efficient realization of the all-reduce operation with large data sizes in cluster environments, under the assumption that the reduce operator is associative and commutative. We derive a tight lower bound of the amount of data that must be communicated in order to complete this operation and propose a ring-based algorithm that only requires tree connectivity to achieve bandwidth optimality. Unlike the widely used butterfly-like all-reduce algorithm that incurs network contention in SMP/multi-cor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
147
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 300 publications
(148 citation statements)
references
References 26 publications
1
147
0
Order By: Relevance
“…In contrast to PS, All-Reduce replaces the use of central nodes with carefully scheduled global communication to achieve better parallelism. The state-of-the-art solutions [31,41,45] leverage Ring All-Reduce [38], the advanced all-reduce algorithm that effectively utilizes the bandwidth between computation devices. Specifically, workers are organized as a ring, and gradients are divided into chunks and passed over the ring in a parallel manner.…”
Section: Existing Synchronization Approachesmentioning
confidence: 99%
“…In contrast to PS, All-Reduce replaces the use of central nodes with carefully scheduled global communication to achieve better parallelism. The state-of-the-art solutions [31,41,45] leverage Ring All-Reduce [38], the advanced all-reduce algorithm that effectively utilizes the bandwidth between computation devices. Specifically, workers are organized as a ring, and gradients are divided into chunks and passed over the ring in a parallel manner.…”
Section: Existing Synchronization Approachesmentioning
confidence: 99%
“…In contrast, homogeneous models enforced by S-PSGD cannot convergence with a large batch size and aggressive learning rate for our ASR task setting [4]. A good allreduce implementation can finish each round of communication after effectively 2 messages are sent across the communication network, independent of the number of learners [6]. We choose the Nvidia NCCL [7] as our allreduce implementation.…”
Section: Design and Implementationmentioning
confidence: 99%
“…We replace the original gRPC implementation with Message Passing Interface (MPI) and NVIDIA Collective Communications Library (NCCL) [8]. NCCL provides a highly optimized version of routines, such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, and the integrated bandwidth-optimal ring all-reduce algorithm [33], to achieve high bandwidth over PCIe on NVIDIA GPU. In order to scale from one GPU to multiple nodes and multiple GPUs, we implement several APIs for communication: 1) a broadcast operation to synchronize parameters among all GPUs at the initialization stage or the recovery from the checkpoint; 2) a distributed optimizer wrapper for synchronization update of parameters; 3) some operations for data partition and barrier, etc.…”
Section: Acceleration By Distributed Trainingmentioning
confidence: 99%