2019
DOI: 10.48550/arxiv.1901.05758
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Abstract: With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways fr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(3 citation statements)
references
References 19 publications
0
3
0
Order By: Relevance
“…Compared to the general-purpose scheduling policies Dominant Resource Fairness [49] and Tetris [52], Optimus shows significant improvements in average job completion time and makespan 3 . Jeon et al [78] analyze log traces from a large-scale DL cluster system. In particular, they analyze the trade-off between locality constraints and queuing delays for large training jobs that occupy a lot of (GPU) resources.…”
Section: Multi-tenantmentioning
confidence: 99%
“…Compared to the general-purpose scheduling policies Dominant Resource Fairness [49] and Tetris [52], Optimus shows significant improvements in average job completion time and makespan 3 . Jeon et al [78] analyze log traces from a large-scale DL cluster system. In particular, they analyze the trade-off between locality constraints and queuing delays for large training jobs that occupy a lot of (GPU) resources.…”
Section: Multi-tenantmentioning
confidence: 99%
“…Workload: We generate the workload similar to the Microsoft job trace [33]. More details about the Microsoft trace can be found in [33] and Appendix of [17]. We generate totally 160 DDL jobs by scaling down the original job trace.…”
Section: A Experimental Setupmentioning
confidence: 99%
“…However, commodity hardware still proves to be useful for GPU clusters [7,8,9], especially in academic settings where the NVIDIA EULA does not seem to apply. Several studies have examined the performance of commodity [10,11] and non-commodity [12] GPU hardware for various calculations, and generally found commodity hardware to be suitable for use in GPU clusters. Although NVIDIA's legal definitions in their EULA are intentionally vague, it seems that using commodity NVIDIA GPUs and the associated NVIDIA drivers/software is allowed for smaller academic uses such as our use-case [13].Although some guidelines exist for GPU clusters [15] and openHPC has "recipes" which are instructions for installing SLURM on a CentOS or SUSE cluster, there is no good step-by-step documentation for creating a commodity GPU cluster from scratch using Ubuntu Linux.…”
mentioning
confidence: 99%