Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1032
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive Attention Span in Transformers

Abstract: We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
109
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 193 publications
(117 citation statements)
references
References 5 publications
0
109
0
1
Order By: Relevance
“…Similarly, raise the question whether 16 attention heads are really necessary to obtain competitive performance. Finally, several recent works address the computational challenge of modeling very long sequences and modify the Transformer architecture with attention operations that reduce time complexity (Shen et al, 2018;Sukhbaatar et al, 2019;Dai et al, 2019;Indurthi et al, 2019;Kitaev et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…Similarly, raise the question whether 16 attention heads are really necessary to obtain competitive performance. Finally, several recent works address the computational challenge of modeling very long sequences and modify the Transformer architecture with attention operations that reduce time complexity (Shen et al, 2018;Sukhbaatar et al, 2019;Dai et al, 2019;Indurthi et al, 2019;Kitaev et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…Recent works have proposed sparse Transform-ers and adaptive span Transformers (Sukhbaatar et al, 2019). However, the "sparsity" of those models only limits the attention to a contiguous span of past tokens, while in this work we propose a highly adaptive Transformer model that is capable of attending to a sparse set of words that are not necessarily contiguous.…”
Section: Introductionmentioning
confidence: 99%
“…Since Transformer has become a promising model for diverse NLP tasks, there have been attempts to improve its architectural efficiency with two majority approaches. The first is to restrict dependencies between input tokens to reduce superfluous pair-wise calculations Guo et al, 2019b;Sukhbaatar et al, 2019a). The approach provides time efficiency during inference, but it does not address the heavy parameterization of Transformer.…”
Section: Towards a Lightweight Transformermentioning
confidence: 99%
“…We compare the Group-Transformer against existing character-level language models using under 50M parameters in Table 4. Although the number of embedding vectors for characters is much lower than word-level embeddings (Sukhbaatar et al, 2019a), the total parameters of most previous models have been more than 10M parameters. Recently, the reported transformer models achieved under 1.2 bpc for enwik8 and text8, but the models under 10M parameters have not been well explored.…”
Section: Comparison Against Prior Character-level Language Modelsmentioning
confidence: 99%