Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation

Eikema, Bryan; Aziz, Wilker

doi:10.18653/v1/2020.coling-main.398

Cited by 48 publications

(107 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 1 shows that there is no strong correlation between the BLEU score ranking of samples and the log probability score ranking for the majority of source sentences; thus, maximum a priori (MAP) decoding is incapable of finding the desired output. In parallel to our study, Eikema and Aziz (2020) also report that the mismatch regarding MLE training of autoregressive models is attributable to the distribution of the probability mass rather than the parameter estimation, resulting in a poor MAP decoding.…”

Section: Introductionsupporting

confidence: 80%

Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models

Bhattacharyya¹,

Rooshenas²,

Naskar³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

The discrepancy between maximum likelihood estimation (MLE) and task measures such as BLEU score has been studied before for autoregressive neural machine translation (NMT) and resulted in alternative training algorithms (Ranzato et al., 2016;Norouzi et al., 2016;Shen et al., 2016;Wu et al., 2018). However, MLE training remains the de facto approach for autoregressive NMT because of its computational efficiency and stability. Despite this mismatch between the training objective and task measure, we notice that the samples drawn from an MLE-based trained NMT support the desired distribution -there are samples with much higher BLEU score comparing to the beam decoding output. To benefit from this observation, we train an energybased model to mimic the behavior of the task measure (i.e., the energy-based model assigns lower energy to samples with higher BLEU score), which is resulted in a re-ranking algorithm based on the samples drawn from NMT: energy-based re-ranking (EBR). We use both marginal energy models (over target sentence) and joint energy models (over both source and target sentences). Our EBR with the joint energy model consistently improves the performance of the Transformer-based NMT: +3.7 BLEU points on IWSLT'14 German-English, +3.37 BELU points on Sinhala-English, +1.4 BLEU points on WMT'16 English-German tasks.

show abstract

Section: Introductionsupporting

confidence: 80%

Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models

Bhattacharyya¹,

Rooshenas²,

Naskar³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…tasks for which there may be more than one correct solution for any given input. As a consequence, probability mass may be spread over an arbitrarily large number of hypotheses (Ott et al, 2018a;Eikema and Aziz, 2020). In contrast, the task of…”

Section: Discussionmentioning

confidence: 99%

“…However, text quality almost invariably decreases for beam sizes larger than k = 5. This phenomenon is sometimes referred to as the beam search curse, and has been investigated in detail by a number of scholarly works (Koehn and Knowles, 2017;Murray and Chiang, 2018;Yang et al, 2018;Stahlberg and Byrne, 2019;Cohen and Beck, 2019;Eikema and Aziz, 2020). Exact decoding can be seen as the case of beam search where the beam size is effectively stretched to infinity.…”

Section: Decodingmentioning

confidence: 99%

Searching for Search Errors in Neural Morphological Inflection

Forster¹,

Meister²,

Cotterell³

2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

View full text Add to dashboard Cite

Neural sequence-to-sequence models are currently the predominant choice for language generation tasks. Yet, on word-level tasks, exact inference of these models reveals the empty string is often the global optimum. Prior works have speculated this phenomenon is a result of the inadequacy of neural models for language generation. However, in the case of morphological inflection, we find that the empty string is almost never the most probable solution under the model. Further, greedy search often finds the global optimum. These observations suggest that the poor calibration of many neural models may stem from characteristics of a specific subset of tasks rather than general ill-suitedness of such models for language generation.

show abstract

“…The latter observation reveals a deficiency that is seemingly specific to the transformer architecture-one that may be linked to observations in natural language generation tasks. More specifically, we take this as quantita-tive evidence for recent qualitative observations that when left to generate lots of text, neural language models based on the transformer architecture tend to babble repetitively (Holtzman et al, 2020;Cohen and Beck, 2019;Eikema and Aziz, 2020).…”

Section: Type-tokenmentioning

confidence: 99%

Language Model Evaluation Beyond Perplexity

Meister¹,

Cotterell²

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

We propose an alternate approach to quantifying how well language models learn natural language: we ask how well they match the statistical tendencies of natural language.To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the humangenerated text on which they were trained. We provide a framework-paired with significance tests-for evaluating the fit of language models to these trends. We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the typetoken relationship of natural language than text produced using standard ancestral sampling; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols surprisingly well.

show abstract

Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation

Cited by 48 publications

References 39 publications

Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models

Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models

Searching for Search Errors in Neural Morphological Inflection

Language Model Evaluation Beyond Perplexity

Contact Info

Product

Resources

About