2021
DOI: 10.48550/arxiv.2104.09864
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

RoFormer: Enhanced Transformer with Rotary Position Embedding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
100
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 109 publications
(131 citation statements)
references
References 15 publications
0
100
0
Order By: Relevance
“…We replace relative attention in vanilla layers by LSH attention (Kitaev, Kaiser, and Levskaya 2020), which allows us to handle 12288-long sequences. To achieve relative attention parametrization, the LSH attention is combined with rotary positional embeddings (Su et al 2021) layers. In this setup, we reach a score of 3.443 bpd with a (3@1, 12@3, 3@1) architecture which has a total linear cost of 10.…”
Section: Imagenet64mentioning
confidence: 99%
See 1 more Smart Citation
“…We replace relative attention in vanilla layers by LSH attention (Kitaev, Kaiser, and Levskaya 2020), which allows us to handle 12288-long sequences. To achieve relative attention parametrization, the LSH attention is combined with rotary positional embeddings (Su et al 2021) layers. In this setup, we reach a score of 3.443 bpd with a (3@1, 12@3, 3@1) architecture which has a total linear cost of 10.…”
Section: Imagenet64mentioning
confidence: 99%
“…Instead of the segment-level recurrence mechanism proposed in that paper, we use shortening to make our model more efficient and feasible to train on longer sequences. Another recently proposed relative attention parametrization is RoFormer (Su et al 2021) where rotary positional embeddings are introduced. We find this work particularly relevant because rotary positional embeddings are compatible with any attention type including efficient attention and can be combined with our model (Section 3.3).…”
Section: Related Workmentioning
confidence: 99%
“…We use the Rotary positional encoding proposed in [61] and extend it to the 3D case. Given a 3D point S i = (x, y, z) ∈ R 3 , and the it's feature…”
Section: Relative 3d Positional Encodingmentioning
confidence: 99%
“…• DEIT-LA, a DEIT model equipped with linearized attention ( §2.2) instead of softmax attention. We also include several variants that improve DEIT-LA, such as PERMUTEFORMER (Chen, 2021), SPE (Liutkus et al, 2021) and Rotary positional embeddings (ROPE, Su et al, 2021) that incorporates relative positional encodings.…”
Section: Algorithm 1 Dynamic Programming For Ripple Attentionmentioning
confidence: 99%