DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Rajbhandari, Samyam; Li, Conglong; Yao, Zhewei; Zhang, Minjia; Aminabadi, Reza Yazdani; Awan, Ammar Ahmad; Rasley, Jeff; He, Yuxiong

doi:10.48550/arxiv.2201.05596

Cited by 8 publications

(12 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although all computation and communication will assign to different CUDA streams, PyTorch will barrier the computation stream to wait for the completion of the communication. On the other hand, the computation is straightforward in the vanilla Transformer model, so there is no opportunity to overlap the communication in the transformer layer, such as Megatron-LM [10] and DeepSpeed-MoE [15].…”

Section: B Parallel Evoformermentioning

confidence: 99%

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Cheng¹,

Wu²,

Yu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Protein structure prediction is an important method for understanding gene translation and protein function in the domain of structural biology. AlphaFold introduced the Transformer model to the field of protein structure prediction with atomic accuracy. However, training and inference of AlphaFold model are time-consuming and expensive because of the special performance characteristics and huge memory consumption. In this paper, we propose FastFold, a highly efficient implementation of protein structure prediction model for training and inference. FastFold includes a series of GPU optimizations based on thorough analysis of AlphaFold's performance. Meanwhile, with Dynamic Axial Parallelism and Duality Async Operation, FastFold achieves high model parallelism scaling efficiency, surpassing existing popular model parallelism techniques. Experimental results show that FastFold reduces overall training time from 11 days to 67 hours and achieves 7.5 ∼ 9.5× speedup for long-sequence inference. Furthermore, We scaled FastFold to 512 GPUs and achieved aggregate 6.02 PetaFLOPs with 90.1% parallel efficiency. The implementation can be found at https: //github.com/hpcaitech/FastFold.

show abstract

Section: B Parallel Evoformermentioning

confidence: 99%

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Cheng¹,

Wu²,

Yu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There are a few works focusing on inheriting knowledge from a dense model to initialize a MoE model, which is the opposite of our work. For instance, Zhang et al (2022) duplicated dense model multiple times to initialize MoE models. Zhang et al (2021) proposed MoEfication.…”

Section: Knowledge Integrationmentioning

confidence: 99%

“…For GPU clusters, all-to-all operation is too slow to scale the MoE model up. Besides, the gating function includes numerous operations to create token-masks, select top-k experts, and perform cumulative-sum to find the tokenid going to each expert and sparse matrix-multiply (Rajbhandari et al, 2022). All these operations are wasteful due to the sparse tenor representation.…”

Section: Introductionmentioning

confidence: 99%

One Student Knows All Experts Know: From Sparse to Dense

Xue¹,

He²,

Ren³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is hard to implement, easy to overfit, and not hardwarefriendly. In this work, inspired by human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, we first propose Singular Value Decomposition Knowledge Gathering (SVD-KG) to gather key knowledge from different pretrained experts. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On Im-ageNet, our OneS preserves 61.7% benefits from MoE. OneS can achieve 78.4% top-1 accuracy with only 15M parameters. On four natural language processing datasets, OneS obtains 88.2% MoE benefits and outperforms SoTA by 51.7% using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve 3.7× inference speedup due to the hardware-friendly architecture.

show abstract

“…The sparsely-activated training paradigm necessitates new systems' support. However, existing MoE training systems, including DeepSpeed-MoE [17], Tutel [12], and FastMoE [6], are still facing some limitations in both usability and efficiency. First, they only support parts of mainstream MoE models and gate networks (e.g.…”

Section: Introductionmentioning

confidence: 99%

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

Nie¹,

Zhao²,

Miao³

et al. 2022

Preprint

View full text Add to dashboard Cite

As giant dense models advance quality but require large-scale expensive GPU clusters for training, the sparsely gated Mixture-of-Experts (MoE), a kind of conditional computation architecture, are proposed to scale models while keeping the computation constant. Specifically, the input data is routed by a gate network and only activates a part of the expert network. Existing MoE training systems only support part of mainstream MoE models (e.g. Top k) training under expensive highbandwidth GPU clusters. In this paper, we present HetuMoE, a high-performance large-scale sparse MoE training system built on Hetu. HetuMoE provides multiple gating strategies and efficient GPU kernel implementations. To further improve the training efficiency on commodity GPU clusters (e.g, with only 1 NiC), we introduce the hierarchical AllToAll communication that combines hierarchical networks and aggregating messages. Compared with existing state-of-the-art MoE systems, HetuMoE obtains at least 15% speedup. Specifically, HetuMoE outperforms DeepSpeed-MoE up to 8.1× under the switch gate with a batch size of 32. The code is available at: https://github.com/PKU-DAIR/Hetu.Preprint. Under review.

show abstract

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Cited by 8 publications

References 29 publications

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

One Student Knows All Experts Know: From Sparse to Dense

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

Contact Info

Product

Resources

About