Sparse Universal Transformer

Tan, Shawn; Shen, Yikang; Chen, Zhenfang; Courville, Aaron; Gan, Chuang

doi:10.18653/v1/2023.emnlp-main.12

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 2023

DOI: 10.18653/v1/2023.emnlp-main.12

|View full text |Cite

Sparse Universal Transformer

Shawn Tan,

Yikang Shen,

Zhenfang Chen

et al.

Abstract: The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers. Empirical evidence shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, scaling UT parameters is much more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mix… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article2

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Conditional computation in neural networks: Principles and research trends

Scardapane,

Baiocchi,

Devoto

et al. 2024

View full text Add to dashboard Cite

This article summarizes principles and ideas from the emerging area of applying conditional computation methods to the design of neural networks. In particular, we focus on neural networks that can dynamically activate or de-activate parts of their computational graph conditionally on their input. Examples include the dynamic selection of, e.g., input tokens, layers (or sets of layers), and sub-modules inside each layer (e.g., channels in a convolutional filter). We first provide a general formalism to describe these techniques in an uniform way. Then, we introduce three notable implementations of these principles: mixture-of-experts (MoEs) networks, token selection mechanisms, and early-exit neural networks. The paper aims to provide a tutorial-like introduction to this growing field. To this end, we analyze the benefits of these modular designs in terms of efficiency, explainability, and transfer learning, with a focus on emerging applicative areas ranging from automated scientific discovery to semantic communication.

show abstract

Conditional computation in neural networks: Principles and research trends

Scardapane,

Baiocchi,

Devoto

et al. 2024

View full text Add to dashboard Cite

show abstract

Towards Understanding Neural Machine Translation with Attention Heads’ Importance

Zhou,

Zhu,

2024

Applied Sciences

View full text Add to dashboard Cite

Although neural machine translation has made great progress, and the Transformer has advanced the state-of-the-art in various language pairs, the decision-making process of the attention mechanism, a crucial component of the Transformer, remains unclear. In this paper, we propose to understand the model’s decisions by the attention heads’ importance. We explore the knowledge acquired by the attention heads, elucidating the decision-making process through the lens of linguistic understanding. Specifically, we quantify the importance of each attention head by assessing its contribution to neural machine translation performance, employing a Masking Attention Heads approach. We evaluate the method and investigate the distribution of attention heads’ importance, as well as its correlation with part-of-speech contribution. To understand the diverse decisions made by attention heads, we concentrate on analyzing multi-granularity linguistic knowledge. Our findings indicate that specialized heads play a crucial role in learning linguistics. By retaining important attention heads and removing the unimportant ones, we can optimize the attention mechanism. This optimization leads to a reduction in the number of model parameters and an increase in the model’s speed. Moreover, by leveraging the connection between attention heads and multi-granular linguistic knowledge, we can enhance the model’s interpretability. Consequently, our research provides valuable insights for the design of improved NMT models.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Sparse Universal Transformer

Cited by 2 publications

References 21 publications

Conditional computation in neural networks: Principles and research trends

Conditional computation in neural networks: Principles and research trends

Towards Understanding Neural Machine Translation with Attention Heads’ Importance

Contact Info

Product

Resources

About