Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation

Lin, Junyang; Sun, Xu; Ren, Xuancheng; Li, Muyu; Su, Qi

doi:10.18653/v1/d18-1331

Cited by 16 publications

(13 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proper "softness" of the distribution could depend not only on the task but also on the query. Lin et al [44] defined a model whose distribution is controlled by a learnable, adaptive temperature parameter. When a "softer" attention is required, the temperature increases, producing a smoother distribution of weights.…”

Section: Distribution Functionsmentioning

confidence: 99%

Attention in Natural Language Processing

Galassi

Lippi

Torroni

2021

IEEE Trans. Neural Netw. Learning Syst.

404

130

View full text Add to dashboard Cite

Attention is an increasingly popular mechanism used in a wide range of neural architectures. The mechanism itself has been realized in a variety of formats. However, because of the fast-paced advances in this domain, a systematic overview of attention is still missing. In this article, we define a unified model for attention architectures in natural language processing, with a focus on those designed to work with vector representations of the textual data. We propose a taxonomy of attention models according to four dimensions: the representation of the input, the compatibility function, the distribution function, and the multiplicity of the input and/or output. We present the examples of how prior information can be exploited in attention models and discuss ongoing research efforts and open challenges in the area, providing the first extensive categorization of the vast body of literature in this exciting domain.

show abstract

Section: Distribution Functionsmentioning

confidence: 99%

Attention in Natural Language Processing

Galassi

Lippi

Torroni

2021

IEEE Trans. Neural Netw. Learning Syst.

404

130

View full text Add to dashboard Cite

show abstract

“…In NLP, this is often achieved via neural attention (Bahdanau et al, 2015;Chen et al, 2015;Rush et al, 2015;Cheng et al, 2016;Parikh et al, 2016;Xie et al, 2017). Many variants of attention, such as temperature-controlled attention (Lin et al, 2018) and sparsemax (Martins and Astudillo, 2016), have been proposed to increase sparsity within the attention weights. However, it is still debatable whether attention scores are truly explanations Wiegreffe and Pinter, 2019).…”

Section: Related Workmentioning

confidence: 99%

Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport

Swanson

Leí

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Selecting input features of top relevance has become a popular method for building selfexplaining models. In this work, we extend this selective rationalization approach to text matching, where the goal is to jointly select and align text pieces, such as tokens or sentences, as a justification for the downstream prediction. Our approach employs optimal transport (OT) to find a minimal cost alignment between the inputs. However, directly applying OT often produces dense and therefore uninterpretable alignments. To overcome this limitation, we introduce novel constrained variants of the OT problem that result in highly sparse alignments with controllable sparsity. Our model is end-to-end differentiable using the Sinkhorn algorithm for OT and can be trained without any alignment annotations. We evaluate our model on the Stack-Exchange, MultiNews, e-SNLI, and MultiRC datasets. Our model achieves very sparse rationale selections with high fidelity while preserving prediction accuracy compared to strong attention baseline models. †

show abstract

“…Instead we only add a learnable scalar parameter and observed that normalizing the weights actually harms performance. Lin et al (2018) introduced a self-adaptive temperature. However, they focused on parametrizing the temperature of timestep t using the activations from timestep t−1.…”

Section: Datamentioning

confidence: 99%

Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Dufter¹,

Schmitt²,

Schütze³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Self-Attention Networks (SANs) are an integral part of successful neural architectures such as Transformer (Vaswani et al., 2017), and thus of pretrained language models such as BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020). Training SANs on a task or pretraining them on language modeling requires large amounts of data and compute resources. We are searching for modifications to SANs that enable faster learning, i.e., higher accuracies after fewer update steps. We investigate three modifications to SANs: direct position interactions, learnable temperature, and convoluted attention. When evaluating them on part-of-speech tagging, we find that direct position interactions are an alternative to position embeddings, and convoluted attention has the potential to speed up the learning process.

show abstract

Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation

Cited by 16 publications

References 11 publications

Attention in Natural Language Processing

Attention in Natural Language Processing

Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport

Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Contact Info

Product

Resources

About