IEEE INFOCOM 2020 - IEEE Conference on Computer Communications 2020
DOI: 10.1109/infocom41043.2020.9155446
|View full text |Cite
|
Sign up to set email alerts
|

Preemptive All-reduce Scheduling for Expediting Distributed DNN Training

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
16
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 48 publications
(21 citation statements)
references
References 15 publications
0
16
0
Order By: Relevance
“…With dynamic graphs, the next iteration might touch a different set of parameters, which would invalidate the schedule derived from the previous iteration. PACE [12] computes the optimal communication schedule and implements preemption by segmenting primitive AllReduce operations into smaller pieces. Although segmenting can indeed mimic preemption, it will on the other hand hurt the total communication time as we have seen in Fig.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…With dynamic graphs, the next iteration might touch a different set of parameters, which would invalidate the schedule derived from the previous iteration. PACE [12] computes the optimal communication schedule and implements preemption by segmenting primitive AllReduce operations into smaller pieces. Although segmenting can indeed mimic preemption, it will on the other hand hurt the total communication time as we have seen in Fig.…”
Section: Related Workmentioning
confidence: 99%
“…Compared to PyTorch DDP, ZeRO can scale to much larger models as each process only needs to maintain a small partition of the model. The high scalabil-√ √ √ PACE [12] √ √ √ [29] employs a different approach where the model stack is decomposed into multiple stages, where data parallelism is applied within one stage and pipeline with model parallelisms govern the workload across stages. One subtle detail is that to attain high training speed, PipeDream slightly sacrifices accuracy by using the latest gradients from multiple concurrent passes.…”
Section: Related Workmentioning
confidence: 99%
“…There are also schedulers for MPI‐based systems like Slurm 16 and Nomad 17 . Machine learning and deep learning clusters are one of the clusters that researcher are working on to develop efficient resource schedulers 25‐28 . For example, Peng et al 26 propose a scheduler that resizes the allocated resources dynamically by using supervised and reinforcement learning on deep learning clusters.…”
Section: Related Workmentioning
confidence: 99%
“…To improve the system scalability, pipelining between computing tasks and communication tasks is one of the main methods to hide some communication overheads [2,23]- [25,29]. Due to the layer-wise structure of deep models, the gradient aggregation of the current layer has no dependency with its previous layer's gradient calculation as shown in Fig.…”
Section: Background and Related Work A Distributed Sgd With Data Para...mentioning
confidence: 99%