2022
DOI: 10.1145/3530811
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Transformers: A Survey

Abstract: Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of “X-former” models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few - which improve upon the original Transformer arc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
209
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 498 publications
(340 citation statements)
references
References 20 publications
0
209
0
1
Order By: Relevance
“…Vanilla transformer relies on the multi-head selfattention mechanism, which scales poorly with the length of the input sequence, requiring quadratic computation time and memory to store all scores that are used to compute the gradients during back-propagation (Qiu et al, 2020). Several Transformer-based models (Kitaev et al, 2020;Tay et al, 2020;Choromanski et al, 2021) have been proposed exploring efficient alternatives that can be used to process long sequences.…”
Section: Sparse-attention Transformersmentioning
confidence: 99%
“…Vanilla transformer relies on the multi-head selfattention mechanism, which scales poorly with the length of the input sequence, requiring quadratic computation time and memory to store all scores that are used to compute the gradients during back-propagation (Qiu et al, 2020). Several Transformer-based models (Kitaev et al, 2020;Tay et al, 2020;Choromanski et al, 2021) have been proposed exploring efficient alternatives that can be used to process long sequences.…”
Section: Sparse-attention Transformersmentioning
confidence: 99%
“…Our work depends heavily on recent advances in efficient Transformers (Tay et al, 2020) that process long sequences (Rae et al, 2020;Beltagy et al, 2020;Zaheer et al, 2020;Roy et al, 2021). Sparse attention , relative position encoding (Shaw et al, 2018;Raffel et al, 2020;Guo et al, 2021), recurrence mechanism and memory (Dai et al, 2019;Weston et al, 2015; and other tricks (Shen et al, 2020;Katharopoulos et al, 2020;Gupta and Berant, 2020;Stock et al, 2021;Yogatama et al, 2021;Borgeaud et al, 2021;Hawthorne et al, 2022) are commonly adopted by recent Transformer variants to make the operation on long sequences more time/memory efficient.…”
Section: Related Workmentioning
confidence: 99%
“…Here, the input is a (fixed-length) sequence of tokens, which is then fed into multiple layers of self-attention. Lightweight versions such as Dis-tilBERT and others (Tay et al, 2020;Fournier et al, 2021) use less parameters but operate on the same type of input. Recently a new family of models emerged (Tolstikhin et al, 2021;Liu et al, 2021a) which also utilize sequence-based input tokens, with an MLP-based, recurrent-free architecture.…”
Section: Introductionmentioning
confidence: 99%