AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Li, Zhuohan; Zheng, Lianmin; Zhong, Yinmin; Liu, Vincent; Sheng, Ying; Jin, Xin; Huang, Y.; Chen, Zhifeng; Zhang, Hao; Stoica, Ion

doi:10.48550/arxiv.2302.11665

Cited by 3 publications

(4 citation statements)

References 26 publications

(51 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many LLMs have parameter sizes exceeding the capacity of a single GPU [5,9]. Therefore, it is necessary to partition them across distributed GPUs and execute them in a model parallel fashion [28,63]. This calls for a memory manager capable of handling distributed memory.…”

Section: Distributed Executionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon,

Li,

Zhuang

et al. 2023

Proceedings of the 29th Symposium on Operating Systems Principles

View full text Add to dashboard Cite

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm.

show abstract

Section: Distributed Executionmentioning

confidence: 99%

“…REEF [21] and Shepherd [61] propose preemption for serving. AlpaServe [28] utilizes model parallelism for statistical multiplexing. However, these general systems fail to take into account the autoregressive property and token state of LLM inference, resulting in missed opportunities for optimization.…”

Section: Related Workmentioning

confidence: 99%

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon,

Li,

Zhuang

et al. 2023

Proceedings of the 29th Symposium on Operating Systems Principles

View full text Add to dashboard Cite

show abstract

“…Consequently, cloud computing is poised to assume the role of the principal undergirding infrastructure for customized largescale model inference in the forthcoming era. Nonetheless, this development has engendered pervasive apprehensions regarding the adaptability of extant cloud computing infrastructure, which is primarily tailored to accommodate lightweight applications, such as microservices, to the shifting paradigms encapsulated within this burgeoning landscape [5][6][7][8][9].…”

Section: Introduction and Observationmentioning

confidence: 99%

The power of big models: Unleashing opportunities for cloud computing

Lin

2023

IFR

View full text Add to dashboard Cite

show abstract

“…In order to meet stringent SLOs, providers often need to over-provision resources [164]. To address this, [82] proposed a serving system, AlpaServe, that automatically selects a strategy for placing and parallelizing large models on a distributed cluster. Results on two production traces, from Microsoft Azure, show that the proposed serving system can improve SLO attainment.…”

Section: Large Languade Models Serving Systemsmentioning

confidence: 99%

Towards resource-aware dialogue systems and sentiment analysis

Pandelea

View full text Add to dashboard Cite

In the past few years, the use of transformer-based models has experienced increasing popularity as new state-of-the-art performance was achieved in several natural language processing tasks. As these models are often extremely large, however, their use for applications within embedded devices may not be feasible. This thesis looks at two specific applications, Dialogue Systems and Sentiment Analysis.These offer great potential to enhance user experience, but at the same time, when running on embedded devices, cannot make use of the same models and algorithms designed for server-based execution, due to factors such as reduced memory capacity and limited computational power. Novel solutions that are resource-and user-aware are therefore needed.Dialogue Systems Research on building dialogue systems able to engage in natural sounding conversation with humans has attracted increasing attention in recent years. This has led to the rise of commercial conversational agents such as Google Home, Alexa and Siri situated on embedded devices, that enable users to interface with a wide range of underlying functionalities in a natural and seamless manner. However, in part due to memory and computational power constraints, these systems necessitate to either be placed on, or initiate frequent communication with, a server in order to process the users' queries. When placed on embedded systems, this communication may act as a bottleneck, resulting in delays as well as in the halt of the system should the network connection be lost or unavailable.Moreover, despite the rise of generative models such as ChatGPT, retrieval-based dialogue systems remain a promising approach due to their ability to deliver syntactically rich and informative responses while allowing for greater control on the responses that the model can provide, which may be critical in some applications. This thesis proposes a new framework for hardware-aware retrieval-based dialogue systems based on the Dual-Encoder architecture, coupled with a clustering method to group candidates pertaining to a same conversation, that reduces storage capacity and computational power requirements. xi xiiSentiment Analysis The availability of new datasets and deep learning techniques have led to a surge of effort directed towards sentiment analysis research. However, little attention has been given to the development of models that are not only accurate, but also suitable for user-specific use or geared towards resourceconstrained devices. State-of-the-art models often have tens of millions of parameters which make it unfeasible to deploy such solutions on devices characterized by limited memory and computational power. This work explores the concept of software-hardware co-design and propose a methodical procedure to select the most desirable model taking into consideration application constraints described in terms of memory and latency. In doing so, it shows how fully utilizing the feature extraction capabilities of large pre-trained language models can close the gap between the ...

show abstract

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Cited by 3 publications

References 26 publications

Efficient Memory Management for Large Language Model Serving with PagedAttention

Efficient Memory Management for Large Language Model Serving with PagedAttention

The power of big models: Unleashing opportunities for cloud computing

Towards resource-aware dialogue systems and sentiment analysis

Contact Info

Product

Resources

About