Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Yeung, Gingfung; Borowiec, Damian; Yang, Renyu; Friday, Adrian; Harper, Richard; Garraghan, Peter

doi:10.1109/tpds.2021.3079202

Cited by 47 publications

(31 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…used and the best available configuration (VM type) to be assigned to each selected node is solved. As in other literature proposals [14], [17]- [19], we assume that multiple jobs can be deployed on the same node, while, within each node, each job receives for exclusive use a certain number of GPUs. As will be observed in Section 4.6, the interference experienced among jobs in the same VM is negligible in our setting.…”

Section: System Architecture and Problem Statementmentioning

confidence: 99%

“…Moreover, they propose a dynamic programming-based heuristic algorithm to determine an effective resource allocation, while jobs are scheduled relying on a FIFO mechanism. Finally, an interference-aware and prediction-based resource manager is proposed in [19], where GPU utilization is identified as a proxy metric that allows to determine good placement decisions.…”

Section: Resource Selectionmentioning

confidence: 99%

See 1 more Smart Citation

A Path Relinking Method for the Joint Online Scheduling and Capacity Allocation of DL Training Workloads in GPU as a Service Systems

Filippini

Lattuada

Ciavotta

et al. 2023

IEEE Trans. Serv. Comput.

View full text Add to dashboard Cite

The Deep Learning (DL) paradigm gained remarkable popularity in recent years. DL models are used to tackle increasingly complex problems, making the training process require considerable computational power. The parallel computing capabilities o ered by modern GPUs partially fulfill this need, but the high costs related to GPU as a Service solutions in the cloud call for e cient capacity planning and job scheduling algorithms to reduce operational costs via resource sharing. In this work, we jointly address the online capacity planning and job scheduling problems from the perspective of cloud end-users. We present a Mixed Integer Linear Programming (MILP) formulation, and a path relinking-based method aiming at optimizing operational costs by (i) rightsizing Virtual Machine (VM) capacity at each node, (ii) partitioning the set of GPUs among multiple concurrent jobs on the same VM, and (iii) determining a due-date-aware job schedule. An extensive experimental campaign attests the e ectiveness of the proposed approach in practical scenarios: costs savings up to 97% are attained compared with first-principle methods based on, e.g., Earliest Deadline First, cost reductions up to 20% are obtained with respect to a previously proposed Hierarchical Method and up to 95% against a dynamic programming-based method from the literature. Scalability analyses show that systems with up to 100 nodes and 450 concurrent jobs can be managed in less than 7 seconds. The validation in a prototype cloud environment shows a deviation below 5% between real and predicted costs.

show abstract

Section: System Architecture and Problem Statementmentioning

confidence: 99%

Section: Resource Selectionmentioning

confidence: 99%

A Path Relinking Method for the Joint Online Scheduling and Capacity Allocation of DL Training Workloads in GPU as a Service Systems

Filippini

Lattuada

Ciavotta

et al. 2023

IEEE Trans. Serv. Comput.

View full text Add to dashboard Cite

show abstract

“…As F is microservicespecific, each key component of DLRA will be profiled by the DLRA Master. We pre-train the prediction model in an offline training stage, similarly to existing approaches [36], [18], [37], based on a set of workload benchmarking and profiling, but will update the model parameters periodically according to the on-the-fly resource usage.…”

Section: Qos Prediction Enginementioning

confidence: 99%

“…The ability to co-locate jobs (i.e., execute within the same CPU or GPU) has been identified as a means to address under-utilization problem. Understanding and achieving high resource utilization or high energy efficiency for heterogeneous workloads in cloud computing is an important topic [44], [57], [58], [27], [37]. Existing work on QoS management when co-locating heterogeneous workloads has two distinct categories: (i) reducing the probability of resource contention by either granting isolated execution environments to LRAs [49] [59] or adjusting task placement to reduce the resource contention on a certain node [60] [11], primarily for runtime QoS of LRA.…”

Section: Related Workmentioning

confidence: 99%

QoS-Aware Co-Scheduling for Distributed Long-Running Applications on Shared Clusters

Zhu

Yang

Sun

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

To achieve a high degree of resource utilization, production clusters need to co-schedule diverse workloads -including both batch analytic jobs with short-lived tasks and long-running applications (LRAs) that execute for a long time frame from hours to months -onto the shared resources. Microservice architecture advances the manifestation of distributed LRAs (DLRAs), comprising multiple interconnected microservices that are executed in long-lived distributed containers and serve massive user requests. Detecting and mitigating QoS violation become even more intractable due to the network uncertainties and latency propagation across dependent microservices. However, current resource managers are only responsible for resource allocation among applications/jobs but agnostic to runtime QoS such as latency at application level. The state-of-the-art QoS-aware scheduling approaches are dedicated for monolithic applications, without considering the temporal-spatio performance variability across distributed microservices. In this paper, we present TOPOSCH, a new scheduling and execution framework to prioritize the QoS of DLRAs whilst balancing the performance of batch jobs and maintaining high cluster utilization through harvesting idle resources. TOPOSCH tracks footprints of every single request across microservices and uses critical path analysis, based on the end-to-end latency graph, to identify microservices that have high risk of QoS violation. Based on microservice and node level risk assessment, we intervene the batch scheduling by adaptively reducing the visible resources to batch tasks and thus delaying their execution to give way to DLRAs. We propose a prediction-based vertical resource auto-scaling mechanism, with the aid of resource-performance modeling and fine-grained resource inference and access control, for prompt recovery of QoS violation. A cost-effective task preemption is leveraged to ensure a low-cost task preemption and resource reclamation during the auto-scaling. TOPOSCH is integrated with Apache YARN and experiments show that TOPOSCH outperforms other baselines in terms of performance guarantee of DLRAs, at an acceptable cost of batch job slowdown. The tail latency of DLRAs is merely 1.12x of the case of executing alone on average in TOPOSCH with a 26% JCT increase of Spark analytic jobs.

show abstract

“…Alternatively, some works use data-driven approaches to make the GPU sharing decision. Horus [159,160] designs a prediction-based interference-aware mechanism that can be integrated with existing DL training scheduling frameworks. The prediction engine in Horus is in charge of estimating the GPU usage of each DL job by accessing its graph and dry running the model upon the job submission.…”

Section: Gpu Sharingmentioning

confidence: 99%

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Gao¹,

Hu²,

Ye³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers. CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Machine learning; • Computer systems organization → Cloud computing.

show abstract

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Cited by 47 publications

References 40 publications

A Path Relinking Method for the Joint Online Scheduling and Capacity Allocation of DL Training Workloads in GPU as a Service Systems

A Path Relinking Method for the Joint Online Scheduling and Capacity Allocation of DL Training Workloads in GPU as a Service Systems

QoS-Aware Co-Scheduling for Distributed Long-Running Applications on Shared Clusters

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Contact Info

Product

Resources

About