2022
DOI: 10.48550/arxiv.2203.14685
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

Abstract: As giant dense models advance quality but require large-scale expensive GPU clusters for training, the sparsely gated Mixture-of-Experts (MoE), a kind of conditional computation architecture, are proposed to scale models while keeping the computation constant. Specifically, the input data is routed by a gate network and only activates a part of the expert network. Existing MoE training systems only support part of mainstream MoE models (e.g. Top k) training under expensive highbandwidth GPU clusters. In this p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 9 publications
0
1
0
Order By: Relevance
“…As for communication optimization, DeepSpeed-MoE [21] and HetuMoe [18] implemented a hierarchical all-to-all communication kernel to improve network utilization. Tutle [17] designed adaptive routing techniques coupled with a specific network architecture.…”
Section: Related Workmentioning
confidence: 99%
“…As for communication optimization, DeepSpeed-MoE [21] and HetuMoe [18] implemented a hierarchical all-to-all communication kernel to improve network utilization. Tutle [17] designed adaptive routing techniques coupled with a specific network architecture.…”
Section: Related Workmentioning
confidence: 99%