Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1331
|View full text |Cite
|
Sign up to set email alerts
|

Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation

Abstract: Most of the Neural Machine Translation (NMT) models are based on the sequence-tosequence (Seq2Seq) model with an encoderdecoder framework equipped with the attention mechanism. However, the conventional attention mechanism treats the decoding at each time step equally with the same matrix, which is problematic since the softness of the attention for different types of words (e.g. content words and function words) should differ. Therefore, we propose a new model with a mechanism called Self-Adaptive Control of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 11 publications
0
13
0
Order By: Relevance
“…The proper "softness" of the distribution could depend not only on the task but also on the query. Lin et al [44] defined a model whose distribution is controlled by a learnable, adaptive temperature parameter. When a "softer" attention is required, the temperature increases, producing a smoother distribution of weights.…”
Section: Distribution Functionsmentioning
confidence: 99%
“…The proper "softness" of the distribution could depend not only on the task but also on the query. Lin et al [44] defined a model whose distribution is controlled by a learnable, adaptive temperature parameter. When a "softer" attention is required, the temperature increases, producing a smoother distribution of weights.…”
Section: Distribution Functionsmentioning
confidence: 99%
“…In NLP, this is often achieved via neural attention (Bahdanau et al, 2015;Chen et al, 2015;Rush et al, 2015;Cheng et al, 2016;Parikh et al, 2016;Xie et al, 2017). Many variants of attention, such as temperature-controlled attention (Lin et al, 2018) and sparsemax (Martins and Astudillo, 2016), have been proposed to increase sparsity within the attention weights. However, it is still debatable whether attention scores are truly explanations Wiegreffe and Pinter, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…Instead we only add a learnable scalar parameter and observed that normalizing the weights actually harms performance. Lin et al (2018) introduced a self-adaptive temperature. However, they focused on parametrizing the temperature of timestep t using the activations from timestep t−1.…”
Section: Datamentioning
confidence: 99%