Proceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies 2020
DOI: 10.1145/3386367.3432728
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing distributed training deployment in heterogeneous GPU clusters

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
2
1

Relationship

3
7

Authors

Journals

citations
Cited by 22 publications
(21 citation statements)
references
References 21 publications
0
21
0
Order By: Relevance
“…Heterogeneous training further improves the scheduling flexibility with more performance gains potentially. However, it requires delicate systems and algorithm support to work well, since the workers have to adopt different hyperparameter settings and inherently make progress at different paces [8,33,38,57]. Given that heterogeneous training remains an active research topic, our production training system only provides experimental support for it at the moment.…”
Section: Gpu Utilizationmentioning
confidence: 99%
“…Heterogeneous training further improves the scheduling flexibility with more performance gains potentially. However, it requires delicate systems and algorithm support to work well, since the workers have to adopt different hyperparameter settings and inherently make progress at different paces [8,33,38,57]. Given that heterogeneous training remains an active research topic, our production training system only provides experimental support for it at the moment.…”
Section: Gpu Utilizationmentioning
confidence: 99%
“…First, we do not consider the optimization solutions for individual training or inference jobs. Training job optimization mainly contains distributed training acceleration [17,74,175] and job placement optimization [84,94,161]. Inference job optimization techniques include workload characterization [15], pipeline execution [75], etc.…”
Section: Relevant Studies Not Included In This Surveymentioning
confidence: 99%
“…Placement studies focus on worker placement to minimize interference [40] instead of proximity to data, and DNN operator placement to achieve model parallelism [41]. Computation scheduling deals with fine-grained operator execution ordering, in case of model-or pipeline-parallel DNN training [42][43] [44]. Compared to distributed DNN training, GNNs are largely trained with data parallelism, incurring large graph data communication that blocks the computation and occupies a majority of the training time (up to 80% [17]).…”
Section: Distributed Training Accelerationmentioning
confidence: 99%