Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Fedus, William; Zoph, Barret; Shazeer, Noam

doi:10.48550/arxiv.2101.03961

Cited by 180 publications

(353 citation statements)

References 30 publications

Supporting

Mentioning

326

Contrasting

Order By: Relevance

“…The first step is to initialize the dense student. For most trainable layers (e.g., embedding layer, attention layer, normalization layer), the teacher and the student have the same structure 1 , so we can copy the weights from teachers following Switch Transformer (Fedus et al, 2021). The challenging part is the MoE layer.…”

Section: Approachmentioning

confidence: 99%

“…Baselines As we are the first work, to our best knowledge, focusing on integrating knowledge from a pretrained MoE, the only two existing strong baselines are the knowledge distillation framework proposed in Meta AI MoE (Artetxe et al, 2021) and Switch Transformer (Fedus et al, 2021).…”

Section: Experimental Settingsmentioning

confidence: 99%

“…(Lepikhin et al, 2020) first scale machine translation transformer model to 600 million parameters using automatic sharding. After that, Fedus et al (2021) further scales the transformer to trillion parameter models with simple and efficient sparsity and shows promising results on natural language understanding. In computer vision, ViT-MoE (Ruiz et al, 2021) matches SoTA performance on ImageNet using 14.7 billion of parameters, while requiring as little as half of the computation at inference time.…”

Section: Mixture-of-expertsmentioning

confidence: 99%

“…For each subset of the input, there would be only a sub-network activated. Such sparse computation of MoE enables us to scale model to trillions of parameters with comparable computation cost (Fedus et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

“…When training MoE model, if we have no regularization, most tokens may be dispatched to a small portion of experts, and other experts receive few tokens. Such imbalanced assignment would lead to lower efficiency and inferior accuracy (Lepikhin et al, 2020;Fedus et al, 2021). Therefore, to achieve balanced workload for different experts, we usually combines router g(•) with load balance loss (Lepikhin et al, 2020) L balance :…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

One Student Knows All Experts Know: From Sparse to Dense

Xue¹,

He²,

Ren³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is hard to implement, easy to overfit, and not hardwarefriendly. In this work, inspired by human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, we first propose Singular Value Decomposition Knowledge Gathering (SVD-KG) to gather key knowledge from different pretrained experts. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On Im-ageNet, our OneS preserves 61.7% benefits from MoE. OneS can achieve 78.4% top-1 accuracy with only 15M parameters. On four natural language processing datasets, OneS obtains 88.2% MoE benefits and outperforms SoTA by 51.7% using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve 3.7× inference speedup due to the hardware-friendly architecture.

show abstract

Section: Approachmentioning

confidence: 99%

Section: Experimental Settingsmentioning

confidence: 99%

Section: Mixture-of-expertsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

One Student Knows All Experts Know: From Sparse to Dense

Xue¹,

He²,

Ren³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

A Memristors‐Based Dendritic Neuron for High‐Efficiency Spatial‐Temporal Information Processing

et al. 2022

View full text Add to dashboard Cite

Diverse microscopic ionic dynamics help mediate the ability of a biological neural network to handle complex tasks with low energy consumption. Thus, rich internal ionic dynamics in memristors based on transition metal oxide are expected to provide a unique and useful platform for implementing energy‐efficient neuromorphic computing. To this end, a titanium oxide (TiOx)‐based interface‐type dynamic memristor and an niobium oxide (NbOx)‐based Mott memristor are integrated as an artificial dendrite and spike‐firing soma, respectively, to construct a dendritic neuron unit for realizing high‐efficiency spatial‐temporal information processing. Further, a dendritic neural network is hardware‐implemented for spatial‐temporal information processing to highlight the computational advantages achieved by incorporating dendritic functions in the network. Human motion recognition is demonstrated using the Nanyang Technological University‐Red Green Blue (NTU‐RGB) dataset as a benchmark spatial‐temporal task; it shows a nearly 20% improvement in accuracy for the memristors‐based hardware incorporating dendrites and a 1000× advantage in power efficiency compared to that of the graphics processing unit (GPU). The dendritic neuron developed in this study can be considered a critical building block for implementing more bio‐plausible neural networks that can manage complex spatial‐temporal tasks with high efficiency.

show abstract

Comprehensive review of Transformer‐based models in neuroscience, neurology, and psychiatry

Cong,

Wang,

Zhou

et al. 2024

Brain-X

View full text Add to dashboard Cite

This comprehensive review aims to clarify the growing impact of Transformer‐based models in the fields of neuroscience, neurology, and psychiatry. Originally developed as a solution for analyzing sequential data, the Transformer architecture has evolved to effectively capture complex spatiotemporal relationships and long‐range dependencies that are common in biomedical data. Its adaptability and effectiveness in deciphering intricate patterns within medical studies have established it as a key tool in advancing our understanding of neural functions and disorders, representing a significant departure from traditional computational methods. The review begins by introducing the structure and principles of Transformer architectures. It then explores their applicability, ranging from disease diagnosis and prognosis to the evaluation of cognitive processes and neural decoding. The specific design modifications tailored for these applications and their subsequent impact on performance are also discussed. We conclude by providing a comprehensive assessment of recent advancements, prevailing challenges, and future directions, highlighting the shift in neuroscientific research and clinical practice towards an artificial intelligence‐centric paradigm, particularly given the prominence of Transformer architecture in the most successful large pre‐trained models. This review serves as an informative reference for researchers, clinicians, and professionals who are interested in understanding and harnessing the transformative potential of Transformer‐based models in neuroscience, neurology, and psychiatry.

show abstract

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Cited by 180 publications

References 30 publications

One Student Knows All Experts Know: From Sparse to Dense

One Student Knows All Experts Know: From Sparse to Dense

A Memristors‐Based Dendritic Neuron for High‐Efficiency Spatial‐Temporal Information Processing

Comprehensive review of Transformer‐based models in neuroscience, neurology, and psychiatry

Contact Info

Product

Resources

About