2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS) 2022
DOI: 10.1109/rtas54340.2022.00020
|View full text |Cite
|
Sign up to set email alerts
|

FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees

Abstract: Deep learning (DL) inference has become an essential building block in modern intelligent applications. Due to the high computational intensity of DL, it is critical to scale DL inference serving systems in response to fluctuating workloads to achieve resource efficiency. Meanwhile, intelligent applications often require strict service level agreements (SLAs), which need to be guaranteed when the system is scaled. The problem is complex and has been tackled only in simple scenarios so far.This paper describes … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 13 publications
(5 citation statements)
references
References 46 publications
0
3
0
Order By: Relevance
“…Moreover, high accuracy is crucial for these services [20,26]. Consequently, inference systems must deliver highly accurate predictions with fewer computing resources (cost-efficient) while meeting latency constraints under workload variations [18,19,29,37].…”
Section: Featurementioning
confidence: 99%
“…Moreover, high accuracy is crucial for these services [20,26]. Consequently, inference systems must deliver highly accurate predictions with fewer computing resources (cost-efficient) while meeting latency constraints under workload variations [18,19,29,37].…”
Section: Featurementioning
confidence: 99%
“…Towards this direction, several works with different approaches have been conducted, aiming to address the problem of efficient resource orchestration and optimization for ML inference serving systems. Adaptive and pre-defined batching techniques [30]- [33] have been introduced aiming to support ML inference, while auto-scaling approaches are also considered [32], [34], [35], [35], [36]. Furthermore, aiming to support efficient ML inference, serverless approaches have been considered [33], [37], while MLbased and predictive solutions for load request and resource utilization have been widely utilized [13], [14], [38]- [43], [43], [44], such as reinforcement learning-based solutions [43].…”
Section: Related Workmentioning
confidence: 99%
“…Figure 9 indicates the QoS violations(top) and average CPU utilization(middle) for the image classification task. For low (20) and medium (35) target QoS constraint, the modelless scheduler mostly uses the ONNX-Mobilenet benchmark, which never violates the constraint. With the high QoS constraint (50), most of the time the TF-Mobilenet benchmark is used, which has a higher QPS capability with a lower CPU utilization.…”
Section: B Model-less Inference Engine Scheduler Evaluationmentioning
confidence: 99%
“…Clockwork [11] leverages the predictable performance of the DNNs, considers the SLO guarantees on the server, and maps requests to the desired model, but does not utilize DNN adaptation. Inferline [44], Llama [45], and FA2 [46] optimize the serving of complex DNN pipelines. INFaas [12] automates the hardware and model-variant selection and deployment through managed services.…”
Section: Related Workmentioning
confidence: 99%