Automatic Reference-Based Evaluation of Pronoun Translation Misses the Point

Guillou, Liane; Hardmeier, Christian

doi:10.18653/v1/d18-1513

Cited by 33 publications

(10 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…4 An alternative approach to automatically evaluate pronoun translation are reference-based methods that produce a score based on word alignment between source, translation output, and reference translation, and identification of pronouns in them, such as AutoPRF (Hardmeier and Federico, 2010) and APT (Miculicich Werlen and Popescu-Belis, 2017). Guillou and Hardmeier (2018) perform a human meta-evaluation and show substantial disagreement between reference-based metrics and human judges, especially because there often exist valid alternative translations that use different pronouns than the reference. Our test set, and our protocol of generating contrastive examples, is focused on selected pronouns to minimize the risk of producing contrastive examples that are actually valid translations.…”

Section: Evaluation Of Pronoun Translationmentioning

confidence: 99%

A Large-Scale Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural Machine Translation

Müller¹,

Rios²,

Voita³

et al. 2018

Proceedings of the Third Conference on Machine Translation: Research Papers

129

104

View full text Add to dashboard Cite

The translation of pronouns presents a special challenge to machine translation to this day, since it often requires context outside the current sentence. Recent work on models that have access to information across sentence boundaries has seen only moderate improvements in terms of automatic evaluation metrics such as BLEU. However, metrics that quantify the overall translation quality are illequipped to measure gains from additional context. We argue that a different kind of evaluation is needed to assess how well models translate inter-sentential phenomena such as pronouns. This paper therefore presents a test suite of contrastive translations focused specifically on the translation of pronouns. Furthermore, we perform experiments with several contextaware models. We show that, while gains in BLEU are moderate for those systems, they outperform baselines by a large margin in terms of accuracy on our contrastive test set. Our experiments also show the effectiveness of parameter tying for multiencoder architectures.

show abstract

Section: Evaluation Of Pronoun Translationmentioning

confidence: 99%

A Large-Scale Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural Machine Translation

Müller¹,

Rios²,

Voita³

et al. 2018

Proceedings of the Third Conference on Machine Translation: Research Papers

129

104

View full text Add to dashboard Cite

show abstract

“…Most of these perform evaluation based on a reference and do not take the context into account. There are also those that suggest using evaluation test sets or better yet combining them with semi-automatic evaluation schemes [24]. More recently, Stojanovski and Fraser [103] propose to use oracle experiments for evaluating the effect of pronoun resolution and coherence in MT.…”

Section: Discussionmentioning

confidence: 99%

A Survey on Document-level Neural Machine Translation: Methods and Evaluation

Maruf¹,

Saleh²,

Haffari³

2019

Preprint

View full text Add to dashboard Cite

Machine translation (MT) is an important task in natural language processing (NLP) as it automates the translation process and reduces the reliance on human translators. With the advent of neural networks, the translation quality surpasses that of the translations obtained using statistical techniques. Up until three years ago, all neural translation models translated sentences independently, without incorporating any extra-sentential information. The aim of this paper is to highlight the major works that have been undertaken in the space of documentlevel machine translation before and after the neural revolution so that researchers can recognise where we started from and which direction we are heading in. When talking about the literature in statistical machine translation (SMT), we focus on works which have tried to improve the translation of specific discourse phenomena, while in neural machine translation (NMT), we focus on works which use the wider context explicitly. In addition to this, we also cover the evaluation strategies that have been introduced to account for the improvements in this domain.

show abstract

“…Jwalapuram et al (2019) propose a model for pronoun translation evaluation trained on pairs of sentences consisting of the reference and a system output with differing pronouns. However, as Guillou and Hardmeier (2018) point out, this fails to take into account that often there is not a 1:1 correspondence between pronouns in different languages. As a result, a system translation may be correct despite not containing the exact pronoun in the reference, and incorrect even if containing the pronoun in the reference, because of differences in the translation of the referent.…”

Section: Coreference Resolution In Machine Translationmentioning

confidence: 99%

“…Alternatives to bleu include F 1 , partial credit, and oracle-guided approaches (Hardmeier and Federico, 2010;Guillou and Hardmeier, 2016;Miculicich Werlen and Popescu-Belis, 2017). However, Guillou and Hardmeier (2018) show that these metrics can miss important cases and propose semi-automatic evaluation. In contrast, our evaluation is completely automatic.…”

Section: Coreference Resolution In Machine Translationmentioning

confidence: 99%

ContraCAT: Contrastive Coreference Analytical Templates for Machine Translation

Stojanovski¹,

Krojer²,

Peskov³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Recent high scores on pronoun translation using context-aware neural machine translation have suggested that current approaches work well. ContraPro is a notable example of a contrastive challenge set for English→German pronoun translation. The high scores achieved by transformer models may suggest that they are able to effectively model the complicated set of inferences required to carry out pronoun translation. This entails the ability to determine which entities could be referred to, identify which entity a sourcelanguage pronoun refers to (if any), and access the target-language grammatical gender for that entity. We first show through a series of targeted adversarial attacks that in fact current approaches are not able to model all of this information well. Inserting small amounts of distracting information is enough to strongly reduce scores, which should not be the case. We then create a new template test set Contracat, designed to individually assess the ability to handle the specific steps necessary for successful pronoun translation. Our analyses show that current approaches to context-aware nmt rely on a set of surface heuristics, which break down when translations require real reasoning. We also propose an approach for augmenting the training data, with some improvements.

show abstract

Automatic Reference-Based Evaluation of Pronoun Translation Misses the Point

Cited by 33 publications

References 15 publications

A Large-Scale Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural Machine Translation

A Large-Scale Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural Machine Translation

A Survey on Document-level Neural Machine Translation: Methods and Evaluation

ContraCAT: Contrastive Coreference Analytical Templates for Machine Translation

Contact Info

Product

Resources

About