2021
DOI: 10.48550/arxiv.2103.02143
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Random Feature Attention

Abstract: Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
30
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 46 publications
(30 citation statements)
references
References 40 publications
0
30
0
Order By: Relevance
“…Linformer [21] and Synthesizer [22] apply low-rank projection attention. Performer [23], Linear Transformer [24], and Random Feature Attention [25] rely on kernel approximation. Reformer [26], Routing Transformer [27], and Sinkhorn Transformer [28] follow the paradigm of re-arranging sequences.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Linformer [21] and Synthesizer [22] apply low-rank projection attention. Performer [23], Linear Transformer [24], and Random Feature Attention [25] rely on kernel approximation. Reformer [26], Routing Transformer [27], and Sinkhorn Transformer [28] follow the paradigm of re-arranging sequences.…”
Section: Related Workmentioning
confidence: 99%
“…The suite consists of different data types, such as images and texts. Many Transformers have been evaluated on the suite [25,50,51,52]. We compared our CDIL-CNN with other models on the following datasets:…”
Section: Long Range Arena Benchmarkmentioning
confidence: 99%
See 1 more Smart Citation
“…This limits the capability of Transformers, e.g., learning fine-grained feature representation as required in many visual recognition tasks. Linear Transformers Recently, there have been a number of linear/efficient variants [5,34,18,19,30,24,17] of Transformers in NLP. For example, [34] learns to shrink the length of Key and Value based on a low-rank assumption.…”
Section: Related Workmentioning
confidence: 99%
“…The parameters of both models are learned jointly to maximize the likelihood of the target sentences given the corresponding source sentences from a parallel corpus. More robust and expressive neural MT systems have also been developed (Guo et al, 2020;Zhu et al, 2020;Kasai et al, 2021a,b;Lioutas and Guo, 2020;Peng et al, 2021;Tay et al, 2021;Nguyen and Salazar, 2019;Wang et al, 2019;Xiong et al, 2020b) based on attention mechanism Luong et al, 2015).…”
Section: Related Workmentioning
confidence: 99%