Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Jeon, Myeongjae; Venkataraman, Shivaram; Phanishayee, Amar; Qian, Junjie; Xiao, Wencong; Yang, Fan

doi:10.48550/arxiv.1901.05758

Cited by 16 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to the general-purpose scheduling policies Dominant Resource Fairness [49] and Tetris [52], Optimus shows significant improvements in average job completion time and makespan 3 . Jeon et al [78] analyze log traces from a large-scale DL cluster system. In particular, they analyze the trade-off between locality constraints and queuing delays for large training jobs that occupy a lot of (GPU) resources.…”

Section: Multi-tenantmentioning

confidence: 99%

Scalable Deep Learning on Distributed Infrastructures

2020

View full text Add to dashboard Cite

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multitenant resource scheduling and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.One of the driving factors of the success of DL is the scale of training in three dimensions. The first dimension of scale is the size and complexity of the models themselves. Starting from simple, shallow neural networks, with increasing depth and more sophisticated model architectures, new breakthroughs in model accuracy were achieved [30,38]. The second dimension of scale is the amount of training data. The model accuracy can, to a large extent, be improved by feeding more training data into the model [56,63]. In practice, it is reported that 10s to 100s of Terabyte (TB) of training data are used in the training of a DL model [27,62]. The third dimension is the scale of the infrastructure. The availability of programmable highly-parallel hardware, especially graphics processing units (GPUs), is a key-enabler to training large models with a lot of training data in a short time [30,206].Our survey is focused on challenges that arise when managing a large, distributed infrastructure for DL. Hosting a large amount of DL models that are trained with large amounts of training data is challenging. This includes questions of parallelization, resource scheduling and elasticity, data management and portability. This field is now in rapid development, with contributions from diverse research communities such as distributed and networked systems, data management, and machine learning. At the same time, we see a number of open source DL frameworks and orchestration systems emerging [4,24,141,195]. In this survey, we bring together, classify and compare the huge body of work on distributed infrastructures for DL from the different communities that contribute to this area. Furthermore, we provide an overview and comparison of the existing open-source DL frameworks and tools that put distributed DL into practice. Finally, we highlight and discuss open research challenges in this field. Complementary SurveysThere are a number of surveys on DL that are complementary to ours. Deng [41] provides a general survey on DL architectures, algorithms and applications. LeCunn et al. pro...

show abstract

Section: Multi-tenantmentioning

confidence: 99%

Scalable Deep Learning on Distributed Infrastructures

2020

View full text Add to dashboard Cite

show abstract

“…Workload: We generate the workload similar to the Microsoft job trace [33]. More details about the Microsoft trace can be found in [33] and Appendix of [17]. We generate totally 160 DDL jobs by scaling down the original job trace.…”

Section: A Experimental Setupmentioning

confidence: 99%

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

Wang¹,

Shi²,

Wang³

et al. 2020

Preprint

View full text Add to dashboard Cite

Distributed Deep Learning (DDL) has rapidly grown its popularity since it helps boost the training performance on high-performance GPU clusters. Efficient job scheduling is indispensable to maximize the overall performance of the cluster when training multiple jobs simultaneously. However, existing schedulers do not consider the communication contention of multiple communication tasks from different distributed training jobs, which could deteriorate the system performance and prolong the job completion time. In this paper, we first establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication contention between nodes. We then propose an efficient algorithm, LWF-κ, to balance the GPU utilization and consolidate the allocated GPUs for each job. When scheduling those communication tasks, we observe that neither avoiding all the contention nor blindly accepting them is optimal to minimize the job completion time. We thus propose a provable algorithm, AdaDUAL, to efficiently schedule those communication tasks. Based on AdaDUAL, we finally propose Ada-SRSF for the DDL job scheduling problem. Simulations on a 64-GPU cluster connected with 10 Gbps Ethernet show that LWF-κ achieves up to 1.59× improvement over the classical first-fit algorithms. More importantly, Ada-SRSF reduces the average job completion time by 20.1% and 36.7%, as compared to the SRSF(1) scheme (avoiding all the contention) and the SRSF(2) scheme (blindly accepting all of two-way communication contention) respectively.

show abstract

“…However, commodity hardware still proves to be useful for GPU clusters [7,8,9], especially in academic settings where the NVIDIA EULA does not seem to apply. Several studies have examined the performance of commodity [10,11] and non-commodity [12] GPU hardware for various calculations, and generally found commodity hardware to be suitable for use in GPU clusters. Although NVIDIA's legal definitions in their EULA are intentionally vague, it seems that using commodity NVIDIA GPUs and the associated NVIDIA drivers/software is allowed for smaller academic uses such as our use-case [13].Although some guidelines exist for GPU clusters [15] and openHPC has "recipes" which are instructions for installing SLURM on a CentOS or SUSE cluster, there is no good step-by-step documentation for creating a commodity GPU cluster from scratch using Ubuntu Linux.…”

mentioning

confidence: 99%

Literature Review and Implementation Overview: High Performance Computing with Graphics Processing Units for Classroom and Research Use

George

2020

Preprint

View full text Add to dashboard Cite

In this report, I discuss the history and current state of GPU HPC systems. Although high-power GPUs have only existed a short time, they have found rapid adoption in deep learning applications. I also discuss an implementation of a commodity-hardware NVIDIA GPU HPC cluster for deep learning research and academic teaching use. and GPU HPC HistoryHigh performance computing (HPC) is typically characterized by large amounts of memory and processing power. HPC, sometimes also called supercomputing, has been around since the 1960s with the introduction of the CDC STAR-100, and continues to push the limits of computing power and capabilities for large-scale problems [1,2]. However, use of graphics processing unit (GPU) in HPC supercomputers has only started in the mid to late 2000s [3,4]. Although graphics processing chips have been around since the 1970s, GPUs were not widely used for computations until the 2000s. During the early 2000s, GPU clusters began to appear for HPC applications. Most of these clusters were designed to run large calculations requiring vast computing power, and many clusters are still designed for that purpose [5].GPUs have been increasingly used for computations due to their commodification, following Moore's Law (demonstrated in Figure 1), and usage in specific applications like neural networks. Although server-grade GPUs can be used in clusters, commodity-grade GPUs are much more cost-effective. A similar amount of computing power with commodity hardware can be obtained for roughly a third of the cost of server-grade hardware. In 2018 NVIDIA suddenly forced businesses to replace commodity GPUs with their server-grade GPUs in what appeared to be primarily motivated by a desire to increase earnings, but may have been related to warranty issues as well [6]. However, commodity hardware still proves to be useful for GPU clusters [7,8,9], especially in academic settings where the NVIDIA EULA does not seem to apply. Several studies have examined the performance of commodity [10,11] and non-commodity [12] GPU hardware for various calculations, and generally found commodity hardware to be suitable for use in GPU clusters. Although NVIDIA's legal definitions in their EULA are intentionally vague, it seems that using commodity NVIDIA GPUs and the associated NVIDIA drivers/software is allowed for smaller academic uses such as our use-case [13].Although some guidelines exist for GPU clusters [15] and openHPC has "recipes" which are instructions for installing SLURM on a CentOS or SUSE cluster, there is no good step-by-step documentation for creating a commodity GPU cluster from scratch using Ubuntu Linux. Ubuntu is currently one of the top-most used Linux distributions for both personal and server use and has a vibrant community as well as support, making Ubuntu a good choice for use as a Linux system. One drawback of Ubuntu is it is frequently updated and may not be as stable as other Linux OS's such as

show abstract

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Cited by 16 publications

References 19 publications

Scalable Deep Learning on Distributed Infrastructures

Scalable Deep Learning on Distributed Infrastructures

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

Literature Review and Implementation Overview: High Performance Computing with Graphics Processing Units for Classroom and Research Use

Contact Info

Product

Resources

About