Optimizing distributed training deployment in heterogeneous GPU clusters

Yi, Xiaodong; Zhang, Shiwei; Luo, Ziyue; Long, Guangcheng; Diao, Lansong; Wu, Chuan; Zheng, Zishan; Yang, Jun; Wang, Lin

doi:10.1145/3386367.3432728

Cited by 22 publications

(21 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Heterogeneous training further improves the scheduling flexibility with more performance gains potentially. However, it requires delicate systems and algorithm support to work well, since the workers have to adopt different hyperparameter settings and inherently make progress at different paces [8,33,38,57]. Given that heterogeneous training remains an active research topic, our production training system only provides experimental support for it at the moment.…”

Section: Gpu Utilizationmentioning

confidence: 99%

Aryl: An Elastic Cluster Scheduler for Deep Learning

Li¹,

Xu²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT). Aryl addresses these combinatorial problems using principled heuristics. It introduces the notion of server preemption cost which it greedily reduces during server reclaiming. It further relies on the JCT reduction value defined for each additional worker for an elastic job to solve the scheduling problem as a multiple-choice knapsack problem. Prototype implementation on a 64-GPU testbed and large-scale simulation with 15-day traces of over 50,000 production jobs show that Aryl brings 1.53x and 1.50x reductions in average queuing time and JCT, and improves cluster usage by up to 26.9% over the cluster scheduler without capacity loaning or elastic scaling.

show abstract

Section: Gpu Utilizationmentioning

confidence: 99%

Aryl: An Elastic Cluster Scheduler for Deep Learning

Li¹,

Xu²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…First, we do not consider the optimization solutions for individual training or inference jobs. Training job optimization mainly contains distributed training acceleration [17,74,175] and job placement optimization [84,94,161]. Inference job optimization techniques include workload characterization [15], pipeline execution [75], etc.…”

Section: Relevant Studies Not Included In This Surveymentioning

confidence: 99%

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Gao¹,

Hu²,

Ye³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers. CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Machine learning; • Computer systems organization → Cloud computing.

show abstract

“…Placement studies focus on worker placement to minimize interference [40] instead of proximity to data, and DNN operator placement to achieve model parallelism [41]. Computation scheduling deals with fine-grained operator execution ordering, in case of model-or pipeline-parallel DNN training [42][43] [44]. Compared to distributed DNN training, GNNs are largely trained with data parallelism, incurring large graph data communication that blocks the computation and occupies a majority of the training time (up to 80% [17]).…”

Section: Distributed Training Accelerationmentioning

confidence: 99%

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

Luo,

Bao,

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Training Graph Neural Networks (GNN) on large graphs is resource-intensive and time-consuming, mainly due to the large graph data that cannot be fit into the memory of a single machine, but have to be fetched from distributed graph storage and processed on the go. Unlike distributed deep neural network (DNN) training, the bottleneck in distributed GNN training lies largely in large graph data transmission for constructing mini-batches of training samples. Existing solutions often advocate data-computation colocation, and do not work well with limited resources where the colocation is infeasible. The potentials of strategical task placement and optimal scheduling of data transmission and task execution have not been well explored. This paper designs an efficient algorithm framework for task placement and execution scheduling of distributed GNN training, to better resource utilization, improve execution pipelining, and expediting training completion. Our framework consists of two modules: (i) an online scheduling algorithm that schedules the execution of training tasks, and the data transmission plan; and (ii) an exploratory task placement scheme that decides the placement of each training task. We conduct thorough theoretical analysis, testbed experiments and simulation studies, and observe up to 67% training speed-up with our algorithm as compared to representative baselines.

show abstract

Optimizing distributed training deployment in heterogeneous GPU clusters

Cited by 22 publications

References 21 publications

Aryl: An Elastic Cluster Scheduler for Deep Learning

Aryl: An Elastic Cluster Scheduler for Deep Learning

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

Contact Info

Product

Resources

About