Horus: An Interference-Aware Resource Manager for Deep Learning Systems

Yeung, Gingfung; Borowiec, Damian; Yang, Renyu; Friday, Adrian; Harper, Richard; Garraghan, Peter

doi:10.1007/978-3-030-60239-0_33

Cited by 6 publications

(5 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, these studies highlight that concurrent jobs can potentially interfere with each other, adversely affecting training performance. Furthermore, the extent of interference depends on the DL models themselves [19,20]. In the pursuit of identifying suitable job combinations, Gandiva employs a trial-and-error approach, while Gavel establishes a threshold for the difference between isolated training and packing decisions.…”

Section: Literature Reviewmentioning

confidence: 99%

Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster

Thanapol,

Lavangnananda,

Leprévost

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) are employed for their parallel processing capabilities, which are essential to train deep learning (DL) models with large datasets within a reasonable time. However, the diverse GPU architectures exhibit variability in training performance depending on DL models. Furthermore, factors such as the number of GPUs for distributed training and batch size significantly impact training efficiency. Addressing the variability in training performance and accounting for these influential factors are critical for optimising resource usage. This paper presents a scheduling policy for DL training tasks in a heterogeneous GPU cluster. It builds upon a model-similarity-based scheduling policy by implementing a round-based mechanism and job packing. The round-based mechanism allows the scheduler to adjust its scheduling decisions periodically, whereas job packing optimises GPU utilisation by fitting additional jobs into a GPU that trains a small model. Results show that implementing a round-based mechanism reduces the makespan by approximately 29%, compared to the scenario without it. Additionally, integrating job packing further decreases the makespan by 5%.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster

Thanapol,

Lavangnananda,

Leprévost

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Inference Serving Systems Most modern model serving systems (e.g., Clipper [9], Amazon Sagemaker, Microsoft AzureML, INFaaS [36], Horus [40], Perseus [29]) treat ML inference as a black box. These approaches must train and manage many models to meet diverse SLOs under varying query loads.…”

Section: Related Workmentioning

confidence: 99%

“…Suitable model-variants today may fail to satisfy SLOs in the future when combined with new compute infrastructure or deployed in a new execution environment [3]. Co-location interference: Fourth, inference models are typically co-located on worker machines to improve resource utilization and reduce operating costs [29,33,36,40]. Unfortunately, model co-location introduces the opportunity for model interference, which can degrade inference latency and cause SLO violations.…”

Section: Introductionmentioning

confidence: 99%

Dynamic Network Adaptation at Inference

Mendoza¹,

Trippel²

2022

Preprint

View full text Add to dashboard Cite

Machine learning (ML) inference is a real-time workload that must comply with strict Service Level Objectives (SLOs), including latency and accuracy targets. Unfortunately, ensuring that SLOs are not violated in inference-serving systems is challenging due to inherent model accuracy-latency tradeoffs, SLO diversity across and within application domains, evolution of SLOs over time, unpredictable query patterns, and co-location interference. In this paper, we observe that neural networks exhibit high degrees of per-input activation sparsity during inference. . Thus, we propose SLO-Aware Neural Networks (slo-nns) which dynamically drop out nodes per-inference query, thereby tuning the amount of computation performed, according to specified SLO optimization targets and machine utilization. slo-nns achieve average speedups of 1.3 − 56.7× with little to no accuracy loss (less than 0.3%). When accuracy constrained, slo-nns are able to serve a range of accuracy targets at low latency with the same trained model. When latency constrained, slo-nns can proactively alleviate latency degradation from co-location interference while maintaining high accuracy to meet latency constraints.

show abstract

“…Concurrent Execution of Co-Located DL Workloads: The concurrent execution of co-located DL workloads leads to workload interference through resource contention, bandwidth bottleneck, race conditions, etc [36]. Different workloads can be scheduled on dedicated GPUs to provide isolation to training processes.…”

Section: Introductionmentioning

confidence: 99%

Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering

Assogba,

Nicolae,

Rafique

2023

2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

Despite significant advances, training deep learning models remains a time-consuming and resource-intensive task. One of the key challenges in this context is the ingestion of the training data, which involves non-trivial overheads: read the training data from a remote repository, apply augmentations and transformations, shuffle the training samples, and assemble them into mini-batches. Despite the introduction of abstractions such as data pipelines that aim to hide such overheads asynchronously, it is often the case that the data ingestion is slower than the training, causing a delay at each training iteration. This problem is further augmented when training multiple deep learning models simultaneously on powerful compute nodes that feature multiple GPUs. In this case, the training data is often reused across different training instances (e.g., in the case of multi-model or ensemble training) or even within the same training instance (e.g., data-parallel training). However, transparent caching solutions (e.g., OS-level POSIX caching) are not suitable to directly mitigate the competition between training instances that reuse the same training data. In this paper, we study the problem of how to minimize the makespan of running two training instances that reuse the same training data. The makespan is subject to a trade-off: if the training instances start at the same time, competition for I/O bandwidth slows down the data pipelines and increases the makespan. If one training instance is staggered, competition is reduced but the makespan increases. We aim to optimize this trade-off by proposing a performance model capable of predicting the makespan based on the staggering between the training instances, which can be used to find the optimal staggering that triggers just enough competition to make optimal use of transparent caching in order to minimize the makespan. Experiments with different combinations of learning models using the same training data demonstrate that (1) staggering is important to minimize the makespan; (2) our performance model is accurate and can predict the optimal staggering in advance based on calibration overhead.

show abstract

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

Cited by 6 publications

References 26 publications

Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster

Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster

Dynamic Network Adaptation at Inference

Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering

Contact Info

Product

Resources

About