Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.298
|View full text |Cite
|
Sign up to set email alerts
|

Improve Transformer Models with Better Relative Position Embeddings

Abstract: Transformer architectures rely on explicit position encodings in order to preserve a notion of word order. In this paper, we argue that existing work does not fully utilize position information. For example, the initial proposal of a sinusoid embedding is fixed and not learnable. In this paper, we first review absolute position embeddings and existing methods for relative position embeddings. We then propose new techniques that encourage increased interaction between query, key and relative position embeddings… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
38
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 60 publications
(44 citation statements)
references
References 10 publications
1
38
0
Order By: Relevance
“…When we generated an additional lightweight convolution based on keys, the model performed worse than composite attention alone (GLUE 74.0 compared to 75.2). This result clarifies the findings of Huang et al (2020), who reported only small improvements from query and key-based relative position embeddings for a subset of the GLUE tasks. Grammaticality judgments were particularly sensitive to position information.…”
Section: Composite Attention Performed the Bestsupporting
confidence: 88%
See 3 more Smart Citations
“…When we generated an additional lightweight convolution based on keys, the model performed worse than composite attention alone (GLUE 74.0 compared to 75.2). This result clarifies the findings of Huang et al (2020), who reported only small improvements from query and key-based relative position embeddings for a subset of the GLUE tasks. Grammaticality judgments were particularly sensitive to position information.…”
Section: Composite Attention Performed the Bestsupporting
confidence: 88%
“…All of our experiments used a convolution kernel size of 17, or eight positions in each direction, a mid-range value that has been found to work well for both relative positions and convolution in language models (Huang et al, 2020;Jiang et al, 2020;Shaw et al, 2018). As in Shaw et al (2018), relative embeddings W C j−i shared weights across heads.…”
Section: Dynamic Convolution (Relative Embeddings)mentioning
confidence: 99%
See 2 more Smart Citations
“…For any fixed offset k, P E pos+k can be represented as a linear function of P Epos. According to the recent progress (Huang et al, 2020), learnable PE and relative position embedding can help to further improve BERT's performances. Therefore, in the refined BERT model, we use learnable PE and relative position representation.…”
Section: Embedding Modulementioning
confidence: 99%