2019
DOI: 10.48550/arxiv.1904.10509
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Generating Long Sequences with Sparse Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
618
0
3

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 369 publications
(622 citation statements)
references
References 0 publications
1
618
0
3
Order By: Relevance
“…There has been researches revealing that, with certain techniques regularizing the head subspace, multi-head attention can learn desired diverse representations [12,16,18]. Considering that the spatial information becomes abstract after downsampling, we intend to strengthen the spatially representational power of multi-head attention.…”
Section: Large Window Attentionmentioning
confidence: 99%
“…There has been researches revealing that, with certain techniques regularizing the head subspace, multi-head attention can learn desired diverse representations [12,16,18]. Considering that the spatial information becomes abstract after downsampling, we intend to strengthen the spatially representational power of multi-head attention.…”
Section: Large Window Attentionmentioning
confidence: 99%
“…Besides, efficient transformers are proposed, which may reduce the time complexity of self-attention from quadratic to linear (or log-linear). For exam-ple, Linformer and Performer (Choromanski et al, 2020) leverage low-rank selfattention; Sparse Transformers (Child et al, 2019) and Big Bird (Zaheer et al, 2020) utilize sparse self-attention; Reformer introduces learnable attention patterns, and Synthesizer (Tay et al, 2021) introduces randomized attention patterns.…”
Section: Related Workmentioning
confidence: 99%
“…Approximated attention methods have been proposed to tackle this problem. Sparse Transformer [17], LogSparse Transformer [18], Longformer [19], and Big Bird [20] use sparse attention mechanism. Linformer [21] and Synthesizer [22] apply low-rank projection attention.…”
Section: Related Workmentioning
confidence: 99%