Modeling Localness for Self-Attention Networks

Yang, Baosong; Tu, Zhaopeng; Wong, Derek F.; Meng, Fandong; Chao, Lidia S.; Zhang, Tong

doi:10.18653/v1/d18-1475

Cited by 177 publications

(120 citation statements)

References 16 publications

Supporting

Mentioning

103

Contrasting

Order By: Relevance

“…We refer readers to Appendix A.1 for the details of our data and experimental settings. Prior studies reveal that modeling locality in lower layers can achieve better performance (Shen et al, 2018;Yu et al, 2018;Yang et al, 2018). Therefore, we merely apply the locality model at the lowest two layers of the encoder.…”

Section: Methodsmentioning

confidence: 99%

“…Previous work has shown that modeling locality benefits SANs for certain tasks. Luong et al (2015) proposed a Gaussian-based local attention with a predictable position; Sperber et al (2018) differently applied a local method with variable window size for acoustic task; Yang et al (2018) investigated the affect of the dynamical local Gaussian bias by combining these two approaches for the translation task. Different from these methods using a learnable local scope, Yang et al (2019b) and Wu et al (2019) restricted the attention area with fixed size by borrowing the concept of convolution into SANs.…”

Section: Related Workmentioning

confidence: 99%

“…Comparison to Existing Approaches We reimplement and compare several existing methods (Sperber et al, 2018;Luong et al, 2015;Yang et al, 2018Yang et al, , 2019b upon TRANSFORMER. Table 1 reports the results on the En-De test set.…”

Section: Effectiveness Of Hybrid Attention Mechanismmentioning

confidence: 99%

See 2 more Smart Citations

Leveraging Local and Global Patterns for Self-Attention Networks

Xu¹,

Wong²,

Yang³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Self-attention networks have received increasing research attention. By default, the hidden states of each word are hierarchically calculated by attending to all words in the sentence, which assembles global information. However, several studies pointed out that taking all signals into account may lead to overlooking neighboring information (e.g. phrase pattern). To address this argument, we propose a hybrid attention mechanism to dynamically leverage both of the local and global information. Specifically, our approach uses a gating scalar for integrating both sources of the information, which is also convenient for quantifying their contributions. Experiments on various neural machine translation tasks demonstrate the effectiveness of the proposed method. The extensive analyses verify that the two types of contexts are complementary to each other, and our method gives highly effective improvements in their integration.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Leveraging Local and Global Patterns for Self-Attention Networks

Xu¹,

Wong²,

Yang³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Model BLEU EN-DE EN-FR Vaswani et al (2017) Transformer big 28.40 41.00 Transformer big + sequence-loss 28.75 41.47 Yang et al (2018) Transformer big + localness 28.89 n/a this work…”

Section: Architecturementioning

confidence: 91%

Look Harder: A Neural Machine Translation Model with Hard Attention

Indurthi¹,

Chung²,

Kim³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Soft-attention based Neural Machine Translation (NMT) models have achieved promising results on several translation tasks. These models attend all the words in the source sequence for each target token, which makes them ineffective for long sequence translation. In this work, we propose a hard-attention based NMT model which selects a subset of source tokens for each target token to effectively handle long sequence translation. Due to the discrete nature of the hard-attention mechanism, we design a reinforcement learning algorithm coupled with reward shaping strategy to efficiently train it. Experimental results show that the proposed model performs better on long sequences and thereby achieves significant BLEU score improvement on English-German (EN-DE) and English-French (EN-FR) translation tasks compared to the soft-attention based NMT.

show abstract

“…3 Related Work Attention Mechanism Attention was first introduced in for machine translation tasks by [2] and it already has become an essential part in different architectures [7,13,26] though that they may have different forms. Many works are trying to modify the attention part for different purposes [3,14,16,22,23,25,29]. Our work is mainly related to the work which tries to improve the multi-head attention mechanism in the Transformer model.…”

Section: Introductionmentioning

confidence: 99%

Improving Multi-head Attention with Capsule Networks

Feng

2019

Natural Language Processing and Chinese Computing

View full text Add to dashboard Cite

Multi-head attention advances neural machine translation by working out multiple versions of attention in different subspaces, but the neglect of semantic overlapping between subspaces increases the difficulty of translation and consequently hinders the further improvement of translation performance. In this paper, we employ capsule networks to comb the information from the multiple heads of the attention so that similar information can be clustered and unique information can be reserved. To this end, we adopt two routing mechanisms of Dynamic Routing and EM Routing, to fulfill the clustering and separating. We conducted experiments on Chinese-to-English and English-to-German translation tasks and got consistent improvements over the strong Transformer baseline.

show abstract

Modeling Localness for Self-Attention Networks

Cited by 177 publications

References 16 publications

Leveraging Local and Global Patterns for Self-Attention Networks

Leveraging Local and Global Patterns for Self-Attention Networks

Look Harder: A Neural Machine Translation Model with Hard Attention

Improving Multi-head Attention with Capsule Networks

Contact Info

Product

Resources

About