DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Aminabadi, Reza Yazdani; Rajbhandari, Samyam; Awan, Ammar Ahmad; Li, Cheng; Li, Du; Zheng, Elton; Ruwase, Olatunji; Smith, Shaden; Zhang, Minjia; Rasley, Jeff; He, Yuxiong

doi:10.1109/sc41404.2022.00051

Cited by 35 publications

(23 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We show that the autoparallelization ability allows AlpaServe to not only generalize to arbitrary model architectures but even also reduce parallelism overheads -hence improved serving performance (see §3.3 for more discussion). To see that, typical manual model-parallel parallelization strategies offered in de facto systems [1,27,28] is to assign an equal number of (transformer) layers to each pipeline stage. These strategies often fail to create balanced workloads across distributed GPUs because contemporary large models have heterogeneous layers, such as embedding operations.…”

Section: Ablation Studymentioning

confidence: 99%

“…AlpaServe is complementary to another large body of work on optimizations for inference over large models. These include techniques like quantization [11], distillation [36], offloading [1], better operator parallelism [32], and CUDA kernel optimization [9,22]. Some of these optimizations are intended to stem the tide of increasing model sizes; however, all of these gains are partial-the challenge of serving large models has continued to escalate rapidly despite these efforts.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Li¹,

Zheng²,

Zhong³

et al. 2023

Preprint

View full text Add to dashboard Cite

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10× higher rates or 6× more burstiness while staying within latency constraints for more than 99% of requests.

show abstract

Section: Ablation Studymentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Li¹,

Zheng²,

Zhong³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, GPT-175B requires 325GB of GPU memory simply to load its model weights. Fitting this model onto GPUs would require at least five A100 (80GB) GPUs and complex parallelism strategies (Pope et al, 2022;Aminabadi et al, 2022). Thus, lowering LLM inference resource requirements has recently attracted intense interest.…”

Section: Introductionmentioning

confidence: 99%

“…Prior efforts to lower resource requirements of LLM inference correspond to three directions: (1) model compression to decrease total memory footprint Yao et al, 2022;Frantar et al, 2022;Xiao et al, 2022); (2) collaborative inference to amortize inference cost via decentralization (Borzunov et al, 2022); and (3) offloading to utilize memory from CPU and disk (Aminabadi et al, 2022;HuggingFace, 2022). These techniques have significantly lowered the resource requirements for using LLMs, but there are distinct limitations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

High-Throughput Screening Single-Atom Alloy for Electroreduction of Dinitrogen to Ammonia

Zheng

Tian²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Exploring electrocatalyst with high activity, selectivity and stability is essential for development of applicable electrocatalytic ammonia synthesis technology. By performing density functional theory calculations, we systematically investigated a series of transition-metal doped Au-based single atom alloys (SAAs) as promising electrocatalysts for nitrogen reduction reaction (NRR). For Au-based electrocatalyst, the first hydrogenation step (*N2→*NNH) normally determines the limiting potential of the overall reaction process. Compared with pristine Au(111) surface, introducing single atom can significantly enhance the binding strength of N2, leading to decreased energy barrier of the key step, i.e., ΔG(*N2→*NNH). According to simulation results, three descriptors were proposed to describe ΔG(*N2→*NNH), including ΔG(*NNH), d-band center, and . Eight doped elements (Ti, V, Nb, Ru, Ta, Os, W, and Mo) were initially screened out with limiting potential ranging from -0.75V to -0.30 V. Particularly, Mo- and W-doped systems possess the best activity with limiting potentials of -0.30 V, respectively. Then the intrinsic relationship between structure and the potential performance was further analyzed by using machine-learning. The selectivity, feasibility, stability of these candidates were also evaluated, confirming that SAA containing Mo, Ru ,Ta, and W could be outstanding NRR electrocatalysts. This work not only broadens the understating of SAA application in electrocatalysis, but also devotes to the discovery of novel NRR electrocatalysts.

show abstract

Assessing Inference Time in Large Language Models

Walkowiak,

Walkowiak

2024

System Dependability - Theory and Applications

View full text Add to dashboard Cite

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Cited by 35 publications

References 8 publications

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

High-Throughput Screening Single-Atom Alloy for Electroreduction of Dinitrogen to Ammonia

Assessing Inference Time in Large Language Models

Contact Info

Product

Resources

About