Topology-aware GPU scheduling for learning workloads in cloud environments

Amaral, Marcelo; Polo, Jordà; Carrera, David; Seelam, Seetharami; Steinder, Małgorzata

doi:10.1145/3126908.3126933

Cited by 48 publications

(35 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As regards the DL training, Xiao et al [26] propose Gandiva, a scheduling framework able to improve latency in training DL models on a GPUs cluster by exploiting heterogeneity and recurrent behaviors of DL jobs while running mini-batch iterations. Finally, a seminal work on scheduling multi-GPUs among competing jobs on highend servers is [27]. The paper proposes a topology-aware scheduling policy for DL jobs in cloud environments, which provides a placement strategy to schedule jobs on a Power8 machine based on NVLink able to satisfy workload requirements preventing also application interference.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing on-demand GPUs in the Cloud for Deep Learning Applications Training

Jahani

Lattuada

Ciavotta

et al. 2019

2019 4th International Conference on Computing, Communications and Security (ICCCS)

View full text Add to dashboard Cite

Deep learning (DL) methods have recently gained popularity and been used in commonplace applications; voice and face recognition, among the others. Despite the growing popularity of DL and the associated hardware acceleration techniques, GPU-based systems still have very high costs. Moreover, while the cloud represents a cost-effective and flexible solution, in large settings operations costs can be further optimized by carefully managing and fostering resource sharing. This work addresses the online joint problem of capacity planning of virtual machines (VMs) and DL training jobs scheduling, and proposes a Mixed Integer Linear Programming (MILP) formulation. In particular, DL jobs are assumed to feature a deadline, while multiple VM types are available from a cloud provider catalog, and each VM has, possibly, multiple GPUs. Our solutions optimize the operations costs by (i) right-sizing the VM capacities; (ii) partitioning the set of GPUs among multiple concurrent jobs running on the same VM, and (iii) determining a deadline-aware job schedule. Our approach is evaluated using an ad-hoc simulator and a prototype environment, and compared against first-principle approaches, resulting in a cost reduction of 45-80%.

show abstract

Section: Related Workmentioning

confidence: 99%

“…As in other literature proposals, the job inter-arrival times have been generated according to Poisson Distributions [27]. Two classes of instances have been generated, which differ in the mean of the distribution: 30s for the first class and 45s for the second class.…”

Section: A Experimental Setupmentioning

confidence: 99%

Optimizing on-demand GPUs in the Cloud for Deep Learning Applications Training

Jahani

Lattuada

Ciavotta

et al. 2019

2019 4th International Conference on Computing, Communications and Security (ICCCS)

View full text Add to dashboard Cite

show abstract

“…Many of them are interrelated, such as L1/L2/L3 memory cache misses, but they are still relevant for many performance driven decisions. An example of such importance is the emerging effort to develop novel NUMA-aware [23] and GPU-topologyaware [24] placement strategies for BigData and DeepLearning workloads, because the topology of modern processors is extremely complex and can significantly impact the performance of applications. Furthermore works like [25] (which use the same features as we do in our experiments) use dimensionality reduction in order to facilitate parameter estimation.…”

Section: Motivationmentioning

confidence: 99%

Automatic Generation of Workload Profiles Using Unsupervised Learning Pipelines

Buchaca

Berral

Carrera

2018

IEEE Trans. Netw. Serv. Manage.

Self Cite

View full text Add to dashboard Cite

The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as "black boxes", where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. Here we examine and model application behavior by finding behavior phases. We use Conditional Restricted Boltzmann Machines (CRBM) to model time-series containing resources traces measurements like CPU, Memory and IO. CRBMs can be used to map a given given historic window of trace behaviour into a single vector. This low dimensional and time-aware vector can be passed through clustering methods, from simplistic ones like k-means to more complex ones like those based on Hidden Markov Models (HMM). We use these methods to find phases of similar behaviour in the workloads. Our experimental evaluation shows that the proposed method is able to identify different phases of resource consumption across different workloads. We show that the distinct phases contain specific resource patterns that distinguish them.

show abstract

“…One cause of such underutilization is that DL resource managers disallow co-location of multiple DL jobs within the same GPU [25,13]; a characteristic shared within other resource managers such as Kubernetes and Yarn that were originally designed for CPU-based workloads [16,35]. Instead, the majority of DL resource managers focus on reducing network latency and locality [3,25,13]. This inability to co-locate DL jobs within the same GPU results in reduced resource utilization, longer queuing times and reduced cost efficiency within DL systems.…”

Section: Introductionmentioning

confidence: 99%

“…Recent DL resource managers have been proposed that make placement decisions by consolidating DL jobs onto fewer machines to minimize workload and JCT [3,37,13]. However, while DL resource managers now exist that allow for co-location [37,29], there has been little attention drawn to the performance interference which arises between multiple DL jobs training within the same GPU.…”

Section: Introductionmentioning

confidence: 99%

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

Yeung

Borowiec

Yang

et al. 2020

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems-ranging from a singular GPU device to machine clusters-require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identified that co-location-multiple jobs co-located within the same GPU-is an effective means to achieve this, such co-location incurs performance interference that directly debilitates DL training and inference performance. Existing approaches to mitigate interference require resource intensive and time consuming kernel profiling ill-suited for runtime scheduling decisions. Current DL system resource management are not designed to deal with these problems. This paper proposes Horus, an interference-aware resource manager for DL systems. Instead of leveraging expensive kernel-profiling, our approach estimates job resource utilization and co-location patterns to determine effective DL job placement to minimize likelihood of interference, as well as improve system resource utilization and makespan. Our analysis shows that interference cause up to 3.2x DL job slowdown. We integrated our approach within the Kubernetes resource manager, and conduct experiments in a DL cluster by training 2,500 DL jobs using 13 different models types. Results demonstrate that Horus is able to outperform other DL resource managers by up to 61.5% for resource utilization and 33.6% for makespan.

show abstract

Topology-aware GPU scheduling for learning workloads in cloud environments

Cited by 48 publications

References 28 publications

Optimizing on-demand GPUs in the Cloud for Deep Learning Applications Training

Optimizing on-demand GPUs in the Cloud for Deep Learning Applications Training

Automatic Generation of Workload Profiles Using Unsupervised Learning Pipelines

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

Contact Info

Product

Resources

About