AutoScale

Gandhi, Anshul; Harchol-Balter, Mor; Raghunathan, Ram; Kozuch, Michael A.

doi:10.1145/2382553.2382556

Cited by 223 publications

(15 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the techniques proposed in [13] to do both lowfrequency planning and high-frequency tuning for the coarsegrained pipelines as a baseline for comparison. In this baseline, we profile the entire pipeline as a single black box to identify the single maximum batch size capable of meeting the SLO, in contrast to InferLine's per-model profiling.…”

Section: Methodsmentioning

confidence: 99%

“…Scaling Down (Algorithm 4): InferLine takes a conservative approach to scaling down the pipeline to prevent unnecessary configuration oscillation which can cause SLO misses. Drawing on the work in [13], the Tuner waits for a period of time after any configuration changes to allow the system to stabilize before considering any down scaling actions. Infer-Line uses a delay of 15 seconds (3x the 5 second activation time of spinning up new replicas in the underlying prediction serving frameworks), but the precise value is unimportant as long as it provides enough time for the pipeline to stabilize after a scaling action.…”

Section: High-frequency Tuningmentioning

confidence: 99%

See 1 more Smart Citation

InferLine

Crankshaw

Sela

et al. 2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

Serving ML prediction pipelines spanning multiple models and hardware accelerators is a key challenge in production machine learning. Optimally configuring these pipelines to meet tight end-to-end latency goals is complicated by the interaction between model batch size, the choice of hardware accelerator, and variation in the query arrival process. In this paper we introduce InferLine, a system which provisions and manages the individual stages of prediction pipelines to meet end-to-end tail latency constraints while minimizing cost. InferLine consists of a low-frequency combinatorial planner and a high-frequency auto-scaling tuner. The low-frequency planner leverages stage-wise profiling, discrete event simulation, and constrained combinatorial search to automatically select hardware type, replication, and batching parameters for each stage in the pipeline. The high-frequency tuner uses network calculus to auto-scale each stage to meet tail latency goals in response to changes in the query arrival process. We demonstrate that InferLine outperforms existing approaches by up to 7.6x in cost while achieving up to 34.5x lower latency SLO miss rate on realistic workloads and generalizes across state-of-the-art model serving frameworks.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: High-frequency Tuningmentioning

confidence: 99%

InferLine

Crankshaw

Sela

et al. 2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

show abstract

“…Instead, if we select the least loaded host, over time we will have balanced hosts, making it difficult to identify less loaded hosts to drain connections from in case of a scale down. Prior works have shown this load unbalancing technique to facilitate server scaling in web clusters [6,16], while we explore this in the context of stateless MME architectures with states distributed across multiple MME hosts (see §7.2).…”

Section: Selection Of Final Host From Viable Hostsmentioning

confidence: 99%

MMLite

Nagendra

Bhattacharya

Gandhi

et al. 2019

Proceedings of the 2019 ACM Symposium on SDN Research

Self Cite

View full text Add to dashboard Cite

With increase in cellular-enabled IoT devices having diverse traffic characteristics and service level objectives (SLOs), handling the control traffic in a scalable and resource-efficient manner in the cellular packet core network is critical. The traditional monolithic design of the cellular core adopted by service-providers is inflexible with respect to the diverse requirements and bursty loads of IoT devices, specifically for properties such as elasticity, customizability, and scalability. To address this key challenge, we focus on the most critical control plane component of the cellular packet core network, the Mobility Management Entity (MME). We present MMLite, a functionally decomposed and stateless MME design wherein individual control procedures are implemented as microservices and states are decoupled from their processing, thus enabling elasticity and fault tolerance. For SLO compliance, we develop a multi-level load balancing approach based on skewed consistent hashing to efficiently distribute incoming connections. We evaluate the performance benefits of MMLite over existing approaches with respect to scaling, fault tolerance, SLO compliance and resource efficiency.

show abstract

“…On the other hand, the autoscaler is in charge of adapting the number of available resources according to the incoming workload [37]. The choice of the autoscaler is critical for many different reasons, and, in particular, for pricing issues, like for example, for the minimization of power consumption in a data center [38,37]. In the following, we assume that an autoscaler is in place, and that the number of incoming requests is not changing the number of available resources, therefore we focus only on the load-balancing algorithm.…”

Section: Related Workmentioning

confidence: 99%