2021
DOI: 10.48550/arxiv.2101.03961
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Abstract: In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each incoming example. The result is a sparsely-activated model -with an outrageous number of parameters -but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs, and training instability. We address these with the Switch Transformer. We simplify th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
326
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 180 publications
(353 citation statements)
references
References 30 publications
1
326
0
Order By: Relevance
“…The first step is to initialize the dense student. For most trainable layers (e.g., embedding layer, attention layer, normalization layer), the teacher and the student have the same structure 1 , so we can copy the weights from teachers following Switch Transformer (Fedus et al, 2021). The challenging part is the MoE layer.…”
Section: Approachmentioning
confidence: 99%
See 4 more Smart Citations
“…The first step is to initialize the dense student. For most trainable layers (e.g., embedding layer, attention layer, normalization layer), the teacher and the student have the same structure 1 , so we can copy the weights from teachers following Switch Transformer (Fedus et al, 2021). The challenging part is the MoE layer.…”
Section: Approachmentioning
confidence: 99%
“…Baselines As we are the first work, to our best knowledge, focusing on integrating knowledge from a pretrained MoE, the only two existing strong baselines are the knowledge distillation framework proposed in Meta AI MoE (Artetxe et al, 2021) and Switch Transformer (Fedus et al, 2021).…”
Section: Experimental Settingsmentioning
confidence: 99%
See 3 more Smart Citations