2017
DOI: 10.48550/arxiv.1701.06538
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Abstract: The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
380
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 246 publications
(419 citation statements)
references
References 24 publications
2
380
0
Order By: Relevance
“…They consider how to scaling the network to more parameters with comparable computation cost, making the model more computation-efficient and better performing. Specifically, they introduce Mixture-of-Experts (MoE) [104] into the MLP-Mixer, and propose Sparse-MLP, which renamed with the work of Tang et al [91]. In fact, Carlos et al [105] Finally, and the most ingeniously, Yu et al [106].…”
Section: Spatial and Channel Projection Blocksmentioning
confidence: 99%
See 1 more Smart Citation
“…They consider how to scaling the network to more parameters with comparable computation cost, making the model more computation-efficient and better performing. Specifically, they introduce Mixture-of-Experts (MoE) [104] into the MLP-Mixer, and propose Sparse-MLP, which renamed with the work of Tang et al [91]. In fact, Carlos et al [105] Finally, and the most ingeniously, Yu et al [106].…”
Section: Spatial and Channel Projection Blocksmentioning
confidence: 99%
“…It is not a modification of the network design, so we do not describe it in detail here. More details about MoE can be found in[104,105].…”
mentioning
confidence: 99%
“…The standard way to learn an ME model is to train a gating network and a set of experts using EM-based methods (Chen et al, 1999;Jordan & Jacobs, 1994;Ng & McLachlan, 2007;Xu et al, 1994;Yang & Ma, 2009). The gating network outputs either experts' weights (Chaer et al, 1997(Chaer et al, , 1998 or hard labels (Garmash & Monz, 2016;Shazeer et al, 2017). Prior works have studied ME for regular time series data, where ME showed success in allocating experts to the most suitable regions of input (Lu, 2006;Weigend et al, 1995).…”
Section: Related Workmentioning
confidence: 99%
“…The goal is to combine the effectiveness of the sparse MoE model and the usability of the dense model. (Shazeer et al, 2017),:…”
Section: Introductionmentioning
confidence: 99%