2019
DOI: 10.48550/arxiv.1909.02074
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Jointly Learning to Align and Translate with Transformer Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(14 citation statements)
references
References 0 publications
0
14
0
Order By: Relevance
“…However, the probability distribution given by the attention mechanism may not necessarily allow for inference on word alignment between vocabularies [10], thus requiring the use of statistical approaches for word alignment [2]. To address this problem, Garg et al [6] proposed to train a Transformer in a multi-task framework. Their multi-task loss function combines a negative log-likelihood loss of the Transformer (relative to the token prediction based on past tokens) with a conditional cross-entropy loss defined by the Kullback-Leibler (KL) divergence.…”
Section: Neural Machine Translationmentioning
confidence: 99%
See 4 more Smart Citations
“…However, the probability distribution given by the attention mechanism may not necessarily allow for inference on word alignment between vocabularies [10], thus requiring the use of statistical approaches for word alignment [2]. To address this problem, Garg et al [6] proposed to train a Transformer in a multi-task framework. Their multi-task loss function combines a negative log-likelihood loss of the Transformer (relative to the token prediction based on past tokens) with a conditional cross-entropy loss defined by the Kullback-Leibler (KL) divergence.…”
Section: Neural Machine Translationmentioning
confidence: 99%
“…In their approach, they used the attention probabilities from the penultimate attention head layer as labels for their supervised algorithm, thus dispensing the need for an annotated word alignment. The divergence between the attention probabilities of one arbitrarily chosen alignment head and the labeled alignment distribution is minimized as a KL divergence optimization problem [6].…”
Section: Neural Machine Translationmentioning
confidence: 99%
See 3 more Smart Citations