Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.398
|View full text |Cite
|
Sign up to set email alerts
|

Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation

Abstract: Recent studies have revealed a number of pathologies of neural machine translation (NMT) systems. Hypotheses explaining these mostly suggest there is something fundamentally wrong with NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most of this evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at identifying the highest-scoring translation, i.e. the mode. We argue that the evidence corroborates the inadequacy of MAP decoding more than casts d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
60
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 48 publications
(107 citation statements)
references
References 39 publications
4
60
0
Order By: Relevance
“…Figure 1 shows that there is no strong correlation between the BLEU score ranking of samples and the log probability score ranking for the majority of source sentences; thus, maximum a priori (MAP) decoding is incapable of finding the desired output. In parallel to our study, Eikema and Aziz (2020) also report that the mismatch regarding MLE training of autoregressive models is attributable to the distribution of the probability mass rather than the parameter estimation, resulting in a poor MAP decoding.…”
Section: Introductionsupporting
confidence: 80%
“…Figure 1 shows that there is no strong correlation between the BLEU score ranking of samples and the log probability score ranking for the majority of source sentences; thus, maximum a priori (MAP) decoding is incapable of finding the desired output. In parallel to our study, Eikema and Aziz (2020) also report that the mismatch regarding MLE training of autoregressive models is attributable to the distribution of the probability mass rather than the parameter estimation, resulting in a poor MAP decoding.…”
Section: Introductionsupporting
confidence: 80%
“…tasks for which there may be more than one correct solution for any given input. As a consequence, probability mass may be spread over an arbitrarily large number of hypotheses (Ott et al, 2018a;Eikema and Aziz, 2020). In contrast, the task of…”
Section: Discussionmentioning
confidence: 99%
“…However, text quality almost invariably decreases for beam sizes larger than k = 5. This phenomenon is sometimes referred to as the beam search curse, and has been investigated in detail by a number of scholarly works (Koehn and Knowles, 2017;Murray and Chiang, 2018;Yang et al, 2018;Stahlberg and Byrne, 2019;Cohen and Beck, 2019;Eikema and Aziz, 2020). Exact decoding can be seen as the case of beam search where the beam size is effectively stretched to infinity.…”
Section: Decodingmentioning
confidence: 99%
“…The latter observation reveals a deficiency that is seemingly specific to the transformer architecture-one that may be linked to observations in natural language generation tasks. More specifically, we take this as quantita-tive evidence for recent qualitative observations that when left to generate lots of text, neural language models based on the transformer architecture tend to babble repetitively (Holtzman et al, 2020;Cohen and Beck, 2019;Eikema and Aziz, 2020).…”
Section: Type-tokenmentioning
confidence: 99%