Proceedings of the Third Conference on Machine Translation: Research Papers 2018
DOI: 10.18653/v1/w18-6318
|View full text |Cite
|
Sign up to set email alerts
|

On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation

Abstract: This work investigates the alignment problem in state-of-the-art multi-head attention models based on the transformer architecture. We demonstrate that alignment extraction in transformer models can be improved by augmenting an additional alignment head to the multi-head source-to-target attention component. This is used to compute sharper attention weights. We describe how to use the alignment head to achieve competitive performance. To study the effect of adding the alignment head, we simulate a dictionarygu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
53
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 46 publications
(53 citation statements)
references
References 22 publications
0
53
0
Order By: Relevance
“…Word alignments are essential for statistical machine translation and useful in NMT, e.g., for imposing priors on attention matrices (Liu et al, 2016;Chen et al, 2016;Alkhouli and Ney, 2017;Alkhouli et al, 2018) or for decoding (Alkhouli et al, 2016;Press and Smith, 2018). Further, word alignments have been successfully used in a range of tasks such as typological analysis (Lewis and Xia, 2008;Östling, 2015b), annotation projection (Yarowsky et al, 2001;Padó and Lapata, 2009;Asgari and Schütze, 2017;Huck et al, 2019) and creating multilingual embeddings (Guo et al, 2016;Ammar et al, 2016;Dufter et al, 2018 Statistical word aligners such as the IBM models (Brown et al, 1993) and their implementations Giza++ (Och and Ney, 2003), fast-align (Dyer et al, 2013), as well as newer models such as eflomal (Östling and Tiedemann, 2016) are widely used for alignment.…”
Section: Introductionmentioning
confidence: 99%
“…Word alignments are essential for statistical machine translation and useful in NMT, e.g., for imposing priors on attention matrices (Liu et al, 2016;Chen et al, 2016;Alkhouli and Ney, 2017;Alkhouli et al, 2018) or for decoding (Alkhouli et al, 2016;Press and Smith, 2018). Further, word alignments have been successfully used in a range of tasks such as typological analysis (Lewis and Xia, 2008;Östling, 2015b), annotation projection (Yarowsky et al, 2001;Padó and Lapata, 2009;Asgari and Schütze, 2017;Huck et al, 2019) and creating multilingual embeddings (Guo et al, 2016;Ammar et al, 2016;Dufter et al, 2018 Statistical word aligners such as the IBM models (Brown et al, 1993) and their implementations Giza++ (Och and Ney, 2003), fast-align (Dyer et al, 2013), as well as newer models such as eflomal (Östling and Tiedemann, 2016) are widely used for alignment.…”
Section: Introductionmentioning
confidence: 99%
“…Similar with MTL-FULLC (Garg et al, 2019), BAO-GUIDED adapts the alignment induction with the to-be-aligned target token by requiring full target sentence as the input. Therefore, BAO-GUIDED is not applicable in cases where alignments are incrementally computed during the decoding process, e.g., dictionary-guided decoding (Alkhouli et al, 2018). In contrast, SHIFT-AET performs quite well on such cases (Section 4.3).…”
Section: Alignment Resultsmentioning
confidence: 99%
“…In addition to AER, we compare the performance of NAIVE-ATT, SHIFT-ATT and SHIFT-AET on dictionary-guided machine translation (Song et al, 2020), which is an alignment-based downstream task. Given source and target constraint pairs from dictionary, the NMT model is encouraged to translate with provided constraints via word alignments (Alkhouli et al, 2018;Hasler et al, 2018;Hokamp and Liu, 2017;Song et al, 2020). More specifically, at each decoding step, the last token of the candidate translation will be revised with target constraint if it is aligned to the corresponding source constraint according to the alignment induction method.…”
Section: Downstream Task Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Deriving alignments is known to more challenging for transformer networks with self-attention and multiple attention heads. There has been some recent work for alleviating this issue by explicitly adding an alignment head to the base architecture [15].…”
Section: Choice Of Nmt Architecturementioning
confidence: 99%