2023
DOI: 10.48550/arxiv.2302.11665
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Abstract: Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 26 publications
(51 reference statements)
0
4
0
Order By: Relevance
“…Many LLMs have parameter sizes exceeding the capacity of a single GPU [5,9]. Therefore, it is necessary to partition them across distributed GPUs and execute them in a model parallel fashion [28,63]. This calls for a memory manager capable of handling distributed memory.…”
Section: Distributed Executionmentioning
confidence: 99%
See 1 more Smart Citation
“…Many LLMs have parameter sizes exceeding the capacity of a single GPU [5,9]. Therefore, it is necessary to partition them across distributed GPUs and execute them in a model parallel fashion [28,63]. This calls for a memory manager capable of handling distributed memory.…”
Section: Distributed Executionmentioning
confidence: 99%
“…REEF [21] and Shepherd [61] propose preemption for serving. AlpaServe [28] utilizes model parallelism for statistical multiplexing. However, these general systems fail to take into account the autoregressive property and token state of LLM inference, resulting in missed opportunities for optimization.…”
Section: Related Workmentioning
confidence: 99%
“…Consequently, cloud computing is poised to assume the role of the principal undergirding infrastructure for customized largescale model inference in the forthcoming era. Nonetheless, this development has engendered pervasive apprehensions regarding the adaptability of extant cloud computing infrastructure, which is primarily tailored to accommodate lightweight applications, such as microservices, to the shifting paradigms encapsulated within this burgeoning landscape [5][6][7][8][9].…”
Section: Introduction and Observationmentioning
confidence: 99%
“…In order to meet stringent SLOs, providers often need to over-provision resources [164]. To address this, [82] proposed a serving system, AlpaServe, that automatically selects a strategy for placing and parallelizing large models on a distributed cluster. Results on two production traces, from Microsoft Azure, show that the proposed serving system can improve SLO attainment.…”
Section: Large Languade Models Serving Systemsmentioning
confidence: 99%