2020
DOI: 10.48550/arxiv.2007.14062
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Big Bird: Transformers for Longer Sequences

Abstract: Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism.To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these propertie… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
154
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 97 publications
(161 citation statements)
references
References 71 publications
(137 reference statements)
0
154
0
Order By: Relevance
“…Yet, as dimension increases, sequence lengths will reach the practical limits of quadratic attention mechanisms. Experimenting with transformers with linear or log-linear attention (Zaheer et al, 2021;Wang et al, 2020a;Vyas et al, 2020;) is a natural extension of our work. In terms of asymptotic complexity, matrix inversion (and the other non linear tasks) is usually handled by O(n 3 ) algorithms (although O(n 2.37 ) methods are known).…”
Section: Out-of-domain Generalization and Retrainingmentioning
confidence: 85%
“…Yet, as dimension increases, sequence lengths will reach the practical limits of quadratic attention mechanisms. Experimenting with transformers with linear or log-linear attention (Zaheer et al, 2021;Wang et al, 2020a;Vyas et al, 2020;) is a natural extension of our work. In terms of asymptotic complexity, matrix inversion (and the other non linear tasks) is usually handled by O(n 3 ) algorithms (although O(n 2.37 ) methods are known).…”
Section: Out-of-domain Generalization and Retrainingmentioning
confidence: 85%
“…Moreover, HiRID introduces a novel high-resolution aspect in ICU data, that needs to be correctly taken into account. Thus, as for other sequence data, one possible explanation could be that when trained with extremely long sequences, models can not use the extracted features in the most effective way [46]. In the case of Transformers, to force the model to learn and extract useful patterns, various kinds of improvements could be made [40].…”
Section: Discussionmentioning
confidence: 99%
“…For that reason, Google's BigBird model is selected in this study, which is one of the most successful long-sequence transformers that supports sequence length of 4000 tokens. To deal with the limitations that other models face, BigBird uses a sparse attention mechanism that reduces the quadratic dependency to linear [55]. That means that it can handle sequences of length up to 8x of what was previously possible using similar hardware.…”
Section: -Way Text Entailmentmentioning
confidence: 99%