2022
DOI: 10.48550/arxiv.2201.05596
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Abstract: As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
12
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(12 citation statements)
references
References 29 publications
0
12
0
Order By: Relevance
“…Although all computation and communication will assign to different CUDA streams, PyTorch will barrier the computation stream to wait for the completion of the communication. On the other hand, the computation is straightforward in the vanilla Transformer model, so there is no opportunity to overlap the communication in the transformer layer, such as Megatron-LM [10] and DeepSpeed-MoE [15].…”
Section: B Parallel Evoformermentioning
confidence: 99%
“…Although all computation and communication will assign to different CUDA streams, PyTorch will barrier the computation stream to wait for the completion of the communication. On the other hand, the computation is straightforward in the vanilla Transformer model, so there is no opportunity to overlap the communication in the transformer layer, such as Megatron-LM [10] and DeepSpeed-MoE [15].…”
Section: B Parallel Evoformermentioning
confidence: 99%
“…There are a few works focusing on inheriting knowledge from a dense model to initialize a MoE model, which is the opposite of our work. For instance, Zhang et al (2022) duplicated dense model multiple times to initialize MoE models. Zhang et al (2021) proposed MoEfication.…”
Section: Knowledge Integrationmentioning
confidence: 99%
“…For GPU clusters, all-to-all operation is too slow to scale the MoE model up. Besides, the gating function includes numerous operations to create token-masks, select top-k experts, and perform cumulative-sum to find the tokenid going to each expert and sparse matrix-multiply (Rajbhandari et al, 2022). All these operations are wasteful due to the sparse tenor representation.…”
Section: Introductionmentioning
confidence: 99%
“…The sparsely-activated training paradigm necessitates new systems' support. However, existing MoE training systems, including DeepSpeed-MoE [17], Tutel [12], and FastMoE [6], are still facing some limitations in both usability and efficiency. First, they only support parts of mainstream MoE models and gate networks (e.g.…”
Section: Introductionmentioning
confidence: 99%