SC22: International Conference for High Performance Computing, Networking, Storage and Analysis 2022
DOI: 10.1109/sc41404.2022.00051
|View full text |Cite
|
Sign up to set email alerts
|

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
19
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 35 publications
(23 citation statements)
references
References 8 publications
0
19
0
1
Order By: Relevance
“…We show that the autoparallelization ability allows AlpaServe to not only generalize to arbitrary model architectures but even also reduce parallelism overheads -hence improved serving performance (see §3.3 for more discussion). To see that, typical manual model-parallel parallelization strategies offered in de facto systems [1,27,28] is to assign an equal number of (transformer) layers to each pipeline stage. These strategies often fail to create balanced workloads across distributed GPUs because contemporary large models have heterogeneous layers, such as embedding operations.…”
Section: Ablation Studymentioning
confidence: 99%
See 1 more Smart Citation
“…We show that the autoparallelization ability allows AlpaServe to not only generalize to arbitrary model architectures but even also reduce parallelism overheads -hence improved serving performance (see §3.3 for more discussion). To see that, typical manual model-parallel parallelization strategies offered in de facto systems [1,27,28] is to assign an equal number of (transformer) layers to each pipeline stage. These strategies often fail to create balanced workloads across distributed GPUs because contemporary large models have heterogeneous layers, such as embedding operations.…”
Section: Ablation Studymentioning
confidence: 99%
“…AlpaServe is complementary to another large body of work on optimizations for inference over large models. These include techniques like quantization [11], distillation [36], offloading [1], better operator parallelism [32], and CUDA kernel optimization [9,22]. Some of these optimizations are intended to stem the tide of increasing model sizes; however, all of these gains are partial-the challenge of serving large models has continued to escalate rapidly despite these efforts.…”
Section: Related Workmentioning
confidence: 99%
“…For example, GPT-175B requires 325GB of GPU memory simply to load its model weights. Fitting this model onto GPUs would require at least five A100 (80GB) GPUs and complex parallelism strategies (Pope et al, 2022;Aminabadi et al, 2022). Thus, lowering LLM inference resource requirements has recently attracted intense interest.…”
Section: Introductionmentioning
confidence: 99%
“…Prior efforts to lower resource requirements of LLM inference correspond to three directions: (1) model compression to decrease total memory footprint Yao et al, 2022;Frantar et al, 2022;Xiao et al, 2022); (2) collaborative inference to amortize inference cost via decentralization (Borzunov et al, 2022); and (3) offloading to utilize memory from CPU and disk (Aminabadi et al, 2022;HuggingFace, 2022). These techniques have significantly lowered the resource requirements for using LLMs, but there are distinct limitations.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation