Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, Noam; Mirhoseini, Azalia; Maziarz, Krzysztof; Davis, Andrew R.; Le, Quoc V.; Hinton, Geoffrey E.; Dean, Jeff

doi:10.48550/arxiv.1701.06538

Cited by 246 publications

(419 citation statements)

References 24 publications

Supporting

Mentioning

380

Contrasting

Order By: Relevance

“…They consider how to scaling the network to more parameters with comparable computation cost, making the model more computation-efficient and better performing. Specifically, they introduce Mixture-of-Experts (MoE) [104] into the MLP-Mixer, and propose Sparse-MLP, which renamed with the work of Tang et al [91]. In fact, Carlos et al [105] Finally, and the most ingeniously, Yu et al [106].…”

Section: Spatial and Channel Projection Blocksmentioning

confidence: 99%

See 1 more Smart Citation

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Liu¹,

Li²,

Tao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multilayer perceptron (MLP), as the first neural network structure to appear, was a big hit. But constrained by the hardware computing power and the size of the datasets, it once sank for tens of years. During this period, we have witnessed a paradigm shift from manual feature extraction to the CNN with local receptive field, and further to the Transformer with global receptive field based on selfattention mechanism. And this year (2021), with the introduction of MLP-Mixer, MLP has re-entered the limelight and has attracted extensive research from the computer vision community. Compare to the conventional MLP, it gets deeper but changes the input from full flattening to patch flattening. Given its high performance and less need for vision-specific inductive bias, the community can't help but wonder, Will deep MLP, the simplest structure with global receptive field but no attention, become a new computer vision paradigm? To answer this question, this survey aims to provide a comprehensive overview of the recent development of deep MLP models in vision. Specifically, we review these MLPs in detail, from the subtle sub-module design to the global network structure. We compare the receptive field, computational complexity, and other properties of different network designs in order to understand the development path of MLPs clearly. The investigation shows that MLPs' resolution-sensitivity and computational densities remain unresolved, and pure MLPs are gradually evolving towards CNN-like. We suggest that the current data volume and computational power are not ready to embrace pure MLPs, and artificial visual guidance remains important. Finally, we provide our viewpoint about open research directions and potential future works. We hope this effort will ignite further interest in the community and encourage better visual tailored design for the neural network in the future.

show abstract

Section: Spatial and Channel Projection Blocksmentioning

confidence: 99%

“…It is not a modification of the network design, so we do not describe it in detail here. More details about MoE can be found in[104,105].…”

mentioning

confidence: 99%

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Liu¹,

Li²,

Tao³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The standard way to learn an ME model is to train a gating network and a set of experts using EM-based methods (Chen et al, 1999;Jordan & Jacobs, 1994;Ng & McLachlan, 2007;Xu et al, 1994;Yang & Ma, 2009). The gating network outputs either experts' weights (Chaer et al, 1997(Chaer et al, , 1998 or hard labels (Garmash & Monz, 2016;Shazeer et al, 2017). Prior works have studied ME for regular time series data, where ME showed success in allocating experts to the most suitable regions of input (Lu, 2006;Weigend et al, 1995).…”

Section: Related Workmentioning

confidence: 99%

Dynamic Combination of Heterogeneous Models for Hierarchical Time Series

Han¹,

Hu²,

Ghosh³

2021

Preprint

View full text Add to dashboard Cite

We introduce a mixture of heterogeneous experts framework called MECATS, which simultaneously forecasts the values of a set of time series that are related through an aggregation hierarchy. Different types of forecasting models can be employed as individual experts so that the form of each model can be tailored to the nature of the corresponding time series. MECATS learns hierarchical relationships during the training stage to help generalize better across all the time series being modeled and also mitigates coherency issues that arise due to constraints imposed by the hierarchy. We further build multiple quantile estimators on top of the point forecasts. The resulting probabilistic forecasts are nearly coherent, distribution-free, and independent of the choice of forecasting models. We conduct a comprehensive evaluation on both point and probabilistic forecasts and also formulate an extension for situations where change points exist in sequential data. In general, our method is robust, adaptive to datasets with different properties, and highly configurable and efficient for large-scale forecasting pipelines.

show abstract

“…The goal is to combine the effectiveness of the sparse MoE model and the usability of the dense model. (Shazeer et al, 2017),:…”

Section: Introductionmentioning

confidence: 99%

One Student Knows All Experts Know: From Sparse to Dense

Xue¹,

He²,

Ren³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is hard to implement, easy to overfit, and not hardwarefriendly. In this work, inspired by human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, we first propose Singular Value Decomposition Knowledge Gathering (SVD-KG) to gather key knowledge from different pretrained experts. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On Im-ageNet, our OneS preserves 61.7% benefits from MoE. OneS can achieve 78.4% top-1 accuracy with only 15M parameters. On four natural language processing datasets, OneS obtains 88.2% MoE benefits and outperforms SoTA by 51.7% using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve 3.7× inference speedup due to the hardware-friendly architecture.

show abstract

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Cited by 246 publications

References 24 publications

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Dynamic Combination of Heterogeneous Models for Hierarchical Time Series

One Student Knows All Experts Know: From Sparse to Dense

Contact Info

Product

Resources

About