2022
DOI: 10.48550/arxiv.2202.07896
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Aryl: An Elastic Cluster Scheduler for Deep Learning

Abstract: Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…Several recent works have focused on fault-tolerant dataparallel training via dynamically changing the global batch size [16,21,34,52]. Fault-tolerant hybrid-parallel training is more challenging because the model is distributed across multiple GPUs.…”
Section: Fault Tolerance In Distributed Trainingmentioning
confidence: 99%
See 2 more Smart Citations
“…Several recent works have focused on fault-tolerant dataparallel training via dynamically changing the global batch size [16,21,34,52]. Fault-tolerant hybrid-parallel training is more challenging because the model is distributed across multiple GPUs.…”
Section: Fault Tolerance In Distributed Trainingmentioning
confidence: 99%
“…CoDDL [16] balances resource efficiency and short job priority in elastic resource sharing problems. Aryl [21] enables elastic resource sharing between inference and training workloads. Pollux [35] considers both resource utilization and statistical efficiency of training jobs when adaptively allocating resources.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation