Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1475
|View full text |Cite
|
Sign up to set email alerts
|

Modeling Localness for Self-Attention Networks

Abstract: Self-attention networks have proven to be of profound value for its strength of capturing global dependencies. In this work, we propose to model localness for self-attention networks, which enhances the ability of capturing useful local context. We cast localness modeling as a learnable Gaussian bias, which indicates the central and scope of the local region to be paid more attention. The bias is then incorporated into the original attention distribution to form a revised distribution. To maintain the strength… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

10
103
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 177 publications
(120 citation statements)
references
References 16 publications
10
103
1
Order By: Relevance
“…We refer readers to Appendix A.1 for the details of our data and experimental settings. Prior studies reveal that modeling locality in lower layers can achieve better performance (Shen et al, 2018;Yu et al, 2018;Yang et al, 2018). Therefore, we merely apply the locality model at the lowest two layers of the encoder.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…We refer readers to Appendix A.1 for the details of our data and experimental settings. Prior studies reveal that modeling locality in lower layers can achieve better performance (Shen et al, 2018;Yu et al, 2018;Yang et al, 2018). Therefore, we merely apply the locality model at the lowest two layers of the encoder.…”
Section: Methodsmentioning
confidence: 99%
“…Previous work has shown that modeling locality benefits SANs for certain tasks. Luong et al (2015) proposed a Gaussian-based local attention with a predictable position; Sperber et al (2018) differently applied a local method with variable window size for acoustic task; Yang et al (2018) investigated the affect of the dynamical local Gaussian bias by combining these two approaches for the translation task. Different from these methods using a learnable local scope, Yang et al (2019b) and Wu et al (2019) restricted the attention area with fixed size by borrowing the concept of convolution into SANs.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Model BLEU EN-DE EN-FR Vaswani et al (2017) Transformer big 28.40 41.00 Transformer big + sequence-loss 28.75 41.47 Yang et al (2018) Transformer big + localness 28.89 n/a this work…”
Section: Architecturementioning
confidence: 91%
“…3 Related Work Attention Mechanism Attention was first introduced in for machine translation tasks by [2] and it already has become an essential part in different architectures [7,13,26] though that they may have different forms. Many works are trying to modify the attention part for different purposes [3,14,16,22,23,25,29]. Our work is mainly related to the work which tries to improve the multi-head attention mechanism in the Transformer model.…”
Section: Introductionmentioning
confidence: 99%