Jointly Learning to Align and Translate with Transformer Models

Garg, Sarthak; Peitz, Stephan; Nallasamy, Udhyakumar; Paulik, Matthias

doi:10.18653/v1/d19-1453

Cited by 109 publications

(147 citation statements)

References 24 publications

(31 reference statements)

Supporting

Mentioning

140

Contrasting

Unclassified

Order By: Relevance

“…We explore this hypothesis on the widely used Gold Alignment dataset 3 and follow Tang et al (2019) to perform the alignment. The only difference being that we average the attention matrices across all heads from the penultimate layer (Garg et al, 2019). The alignment error rate (AER, Och and Ney 2003), precision (P) and recall (R) are reported as the evaluation metrics.…”

Section: Alignment Qualitymentioning

confidence: 99%

Self-Attention with Cross-Lingual Position Representation

Ding¹,

Wang²,

Tao³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Position encoding (PE), an essential part of self-attention networks (SANs), is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences. However, in cross-lingual scenarios, e.g., machine translation, the PEs of source and target sentences are modeled independently. Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem. In this paper, we augment SANs with crosslingual position representations to model the bilingually aware latent structure for the input sentence. Specifically, we utilize bracketing transduction grammar (BTG)-based reordering information to encourage SANs to learn bilingual diagonal alignments. Experimental results on WMT'14 English⇒German, WAT'17 Japanese⇒English, and WMT'17 Chinese⇔English translation tasks demonstrate that our approach significantly and consistently improves translation quality over strong baselines. Extensive analyses confirm that the performance gains come from the cross-lingual information.

show abstract

Section: Alignment Qualitymentioning

confidence: 99%

Self-Attention with Cross-Lingual Position Representation

Ding¹,

Wang²,

Tao³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…A closely related research area attempts to guide the attention mechanism, e.g. by incorporating alignment objectives (Garg et al, 2019), or improving the representation through external information such as syntactic supervision (Pham et al, 2019;Currey and Heafield, 2019;Deguchi et al, 2019). The third line of research argues that Transformer networks are over-parametrized and learn redundant information that can be pruned in various ways (Sanh et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Raganato

Scherrer

Tiedemann

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed -non-learnable -attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.

show abstract

“…The attention mechanism in NMT does not functionally play the role of word alignments between the source and the target, at least not in the same way as its analog in SMT. It is hard to interpret the attention activations and extract meaningful word alignments especially from Transformer (Garg et al, 2019). As a result, the most widely used word alignment tools are still external statistical models such as FAST-ALIGN (Dyer et al, 2013) and GIZA++ (Brown et al, 1993;Och and Ney, 2003).…”

Section: Introductionmentioning

confidence: 99%

“…1. However, such schedule only captures noisy word alignments (Ding et al, 2019;Garg et al, 2019). One of the major problems is that it induces alignment before observing the to-be-aligned target token (Peter et al, 2017;Ding et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

“…To alleviate this problem, some researchers modify the transformer architecture by adding alignment modules that predict the to-be-aligned target token (Zenkel et al, 2019(Zenkel et al, , 2020 or modify the training loss by designing an alignment loss computed with full target sentence (Garg et al, 2019;Zenkel et al, 2020). Others argue that using only attention weights is insufficient for generating clean word alignment and propose to induce alignments with feature importance measures, such as leaveone-out measures (Li et al, 2019) and gradientbased measures (Ding et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Accurate Word Alignment Induction from Neural Machine Translation

Chen¹,

Liu²,

Chen³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Despite its original goal to jointly learn to align and translate, prior researches suggest that Transformer captures poor word alignments through its attention mechanism. In this paper, we show that attention weights DO capture accurate word alignments and propose two novel word alignment induction methods SHIFT-ATT and SHIFT-AET. The main idea is to induce alignments at the step when the to-be-aligned target token is the decoder input rather than the decoder output as in previous work. SHIFT-ATT is an interpretation method that induces alignments from the attention weights of Transformer and does not require parameter update or architecture change. SHIFT-AET extracts alignments from an additional alignment module which is tightly integrated into Transformer and trained in isolation with supervision from symmetrized SHIFT-ATT alignments. Experiments on three publicly available datasets demonstrate that both methods perform better than their corresponding neural baselines and SHIFT-AET significantly outperforms GIZA++ by 1.4-4.8 AER points. 1 * Corresponding author. Part of the work was done when Yun was in Huawei Noah's Ark Lab.1 Code can be found at https://github.com/ sufe-nlp/transformer-alignment.Source: das weiß ich . Dec. input: i understand this . Dec. output: i understand this .

show abstract

Jointly Learning to Align and Translate with Transformer Models

Cited by 109 publications

References 24 publications

Self-Attention with Cross-Lingual Position Representation

Self-Attention with Cross-Lingual Position Representation

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Accurate Word Alignment Induction from Neural Machine Translation

Contact Info

Product

Resources

About