Adaptive Attention Span in Transformers

Sukhbaatar, Sainbayar; Grave, Édouard; Bojanowski, Piotr; Joulin, Armand

doi:10.18653/v1/p19-1032

Cited by 193 publications

(117 citation statements)

References 5 publications

Supporting

Mentioning

109

Contrasting

Unclassified

Order By: Relevance

“…Similarly, raise the question whether 16 attention heads are really necessary to obtain competitive performance. Finally, several recent works address the computational challenge of modeling very long sequences and modify the Transformer architecture with attention operations that reduce time complexity (Shen et al, 2018;Sukhbaatar et al, 2019;Dai et al, 2019;Indurthi et al, 2019;Kitaev et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Raganato

Scherrer

Tiedemann

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed -non-learnable -attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.

show abstract

Section: Introductionmentioning

confidence: 99%

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Raganato

Scherrer

Tiedemann

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…Recent works have proposed sparse Transform-ers and adaptive span Transformers (Sukhbaatar et al, 2019). However, the "sparsity" of those models only limits the attention to a contiguous span of past tokens, while in this work we propose a highly adaptive Transformer model that is capable of attending to a sparse set of words that are not necessarily contiguous.…”

Section: Introductionmentioning

confidence: 99%

Adaptively Sparse Transformers

Correia¹,

Niculae²,

Martins³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

117

View full text Add to dashboard Cite

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multiheaded attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, contextdependent sparsity patterns. This sparsity is accomplished by replacing softmax with αentmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the α parameter -which controls the shape and sparsity of αentmax -allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.

show abstract

“…Since Transformer has become a promising model for diverse NLP tasks, there have been attempts to improve its architectural efficiency with two majority approaches. The first is to restrict dependencies between input tokens to reduce superfluous pair-wise calculations Guo et al, 2019b;Sukhbaatar et al, 2019a). The approach provides time efficiency during inference, but it does not address the heavy parameterization of Transformer.…”

Section: Towards a Lightweight Transformermentioning

confidence: 99%

“…We compare the Group-Transformer against existing character-level language models using under 50M parameters in Table 4. Although the number of embedding vectors for characters is much lower than word-level embeddings (Sukhbaatar et al, 2019a), the total parameters of most previous models have been more than 10M parameters. Recently, the reported transformer models achieved under 1.2 bpc for enwik8 and text8, but the models under 10M parameters have not been well explored.…”

Section: Comparison Against Prior Character-level Language Modelsmentioning

confidence: 99%

Scale down Transformer by Grouping Features for a Lightweight Character-level Language Model

Park

Lee

Cha

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

This paper introduces a method that efficiently reduces the computational cost and parameter size of Transformer. The proposed model, refer to as Group-Transformer, splits feature space into multiple groups, factorizes the calculation paths, and reduces computations for the group interaction. Extensive experiments on two benchmark tasks, enwik8 and text8, prove our model's effectiveness and efficiency in small-scale Transformers. To the best of our knowledge, Group-Transformer is the first attempt to design Transformer with the group strategy, widely used for efficient CNN architectures.

show abstract

Adaptive Attention Span in Transformers

Cited by 193 publications

References 5 publications

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Adaptively Sparse Transformers

Scale down Transformer by Grouping Features for a Lightweight Character-level Language Model

Contact Info

Product

Resources

About