FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees

Razavi, Kamran; Luthra, Manisha; Koldehofe, Boris; Mühlhäuser, Max; Wang, Lin

doi:10.1109/rtas54340.2022.00020

Cited by 13 publications

(5 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, high accuracy is crucial for these services [20,26]. Consequently, inference systems must deliver highly accurate predictions with fewer computing resources (cost-efficient) while meeting latency constraints under workload variations [18,19,29,37].…”

Section: Featurementioning

confidence: 99%

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Salmani

Ghafouri

Sanaee

et al. 2023

Proceedings of the 3rd Workshop on Machine Learning and Systems

Self Cite

View full text Add to dashboard Cite

The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose InfAdapter, which proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler).

show abstract

Section: Featurementioning

confidence: 99%

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Salmani

Ghafouri

Sanaee

et al. 2023

Proceedings of the 3rd Workshop on Machine Learning and Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…Towards this direction, several works with different approaches have been conducted, aiming to address the problem of efficient resource orchestration and optimization for ML inference serving systems. Adaptive and pre-defined batching techniques [30]- [33] have been introduced aiming to support ML inference, while auto-scaling approaches are also considered [32], [34], [35], [35], [36]. Furthermore, aiming to support efficient ML inference, serverless approaches have been considered [33], [37], while MLbased and predictive solutions for load request and resource utilization have been widely utilized [13], [14], [38]- [43], [43], [44], such as reinforcement learning-based solutions [43].…”

Section: Related Workmentioning

confidence: 99%

“…Figure 9 indicates the QoS violations(top) and average CPU utilization(middle) for the image classification task. For low (20) and medium (35) target QoS constraint, the modelless scheduler mostly uses the ONNX-Mobilenet benchmark, which never violates the constraint. With the high QoS constraint (50), most of the time the TF-Mobilenet benchmark is used, which has a higher QPS capability with a lower CPU utilization.…”

Section: B Model-less Inference Engine Scheduler Evaluationmentioning

confidence: 99%

IRIS: Interference and Resource Aware Predictive Orchestration for ML Inference Serving

Ferikoglou,

Chrysomeris,

Tzenetopoulos

et al. 2023

2023 IEEE 16th International Conference on Cloud Computing (CLOUD)

View full text Add to dashboard Cite

Over the last years, the ever-growing number of Machine Learning(ML) and Artificial Intelligence(AI) applications deployed in the Cloud has led to high demands on the computing resources required for efficient processing. Multiple users deploy multiple applications on the same server node to maximize Quality of Service(QoS); however, this leads to increased interference. In addition, Cloud providers aim to minimize their operating costs by efficiently utilizing the available resources. These conflicting optimization goals form a complex paradigm where efficient scheduling is required.In this work, we present IRIS, an interference-and resourceaware predictive inference scheduling framework for ML inference serving in the cloud. We target the multi-objective problem of QoS maximization with effective CPU utilization based on Queries per Second(QPS) predictions by proposing a modelless ML-based solution and integrating it into the Kubernetes platform. Our approach is evaluated over real hardware infrastructure and a set of ML applications. Our experimental analysis shows that under various QoS constraints, the modelspecific interference-aware scheduler violates QoS constraints less frequently by achieving 1.8× fewer violations, on average, compared to over-provisioning and 3.1× fewer violations compared to under-provisioning, through efficient exploitation of available CPU resources. The model-less feature is able to cause, on average, 1.5× fewer violations compared to the model-specific scheduler, while further reducing the average CPU utilization by ≈ 30%.

show abstract

“…Clockwork [11] leverages the predictable performance of the DNNs, considers the SLO guarantees on the server, and maps requests to the desired model, but does not utilize DNN adaptation. Inferline [44], Llama [45], and FA2 [46] optimize the serving of complex DNN pipelines. INFaas [12] automates the hardware and model-variant selection and deployment through managed services.…”

Section: Related Workmentioning

confidence: 99%

Jellyfish: Timely Inference Serving for Dynamic Edge Networks

Nigade

Bauszat

Bal

et al. 2022

2022 IEEE Real-Time Systems Symposium (RTSS)

Self Cite

View full text Add to dashboard Cite

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

show abstract

FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees

Cited by 13 publications

References 46 publications

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

IRIS: Interference and Resource Aware Predictive Orchestration for ML Inference Serving

Jellyfish: Timely Inference Serving for Dynamic Edge Networks

Contact Info

Product

Resources

About