2019
DOI: 10.48550/arxiv.1905.07799
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adaptive Attention Span in Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
50
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 34 publications
(52 citation statements)
references
References 0 publications
0
50
0
Order By: Relevance
“…Models based on Transformers (Vaswani et al, 2017), such as BERT (Devlin et al, 2018), or variants (Yang et al, 2019;Lan et al, 2019;Raffel et al, 2019) yield state-of-the-art results in many NLP tasks such as language modeling (Child et al, 2019;Sukhbaatar et al, 2019;Rae et al, 2019;Kitaev et al, 2020), question answering Lan et al, 2019;Zaheer et al, 2020;Beltagy et al, 2020), and summarization (Zhang et al, 2019). However, existing studies show that they do not have good compositional generalization.…”
Section: Transformer Modelsmentioning
confidence: 99%
“…Models based on Transformers (Vaswani et al, 2017), such as BERT (Devlin et al, 2018), or variants (Yang et al, 2019;Lan et al, 2019;Raffel et al, 2019) yield state-of-the-art results in many NLP tasks such as language modeling (Child et al, 2019;Sukhbaatar et al, 2019;Rae et al, 2019;Kitaev et al, 2020), question answering Lan et al, 2019;Zaheer et al, 2020;Beltagy et al, 2020), and summarization (Zhang et al, 2019). However, existing studies show that they do not have good compositional generalization.…”
Section: Transformer Modelsmentioning
confidence: 99%
“…Instead of architectural strategies, several approaches are proposed to directly solve O(N 2 ) problem of self-attention mechanism. They are summarized to some categories; Special patterns [20,5,33], exploiting associative law [7,31], using low rank factorization method [40], linear approximation through sampling important tokens [24,46], and using cross-covariance matrices instead of Gram matrices [12]. Although detailed methods are quite different, our UFO scheme is mainly related to utilizing associative law.…”
Section: Related Workmentioning
confidence: 99%
“…Cordonnier et al tried to address the redundancy issue with a modified multi-head mechanism [38]. Sukhbaatar et al proposed a modified Transformer with an adaptive attention span [39].…”
Section: Local Multi-headmentioning
confidence: 99%