Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1513
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Reference-Based Evaluation of Pronoun Translation Misses the Point

Abstract: We compare the performance of the APT and AutoPRF metrics for pronoun translation against a manually annotated dataset comprising human judgements as to the correctness of translations of the PROTEST test suite. Although there is some correlation with the human judgements, a range of issues limit the performance of the automated metrics. Instead, we recommend the use of semiautomatic metrics and test suites in place of fully automatic metrics.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 33 publications
(10 citation statements)
references
References 15 publications
0
10
0
Order By: Relevance
“…4 An alternative approach to automatically evaluate pronoun translation are reference-based methods that produce a score based on word alignment between source, translation output, and reference translation, and identification of pronouns in them, such as AutoPRF (Hardmeier and Federico, 2010) and APT (Miculicich Werlen and Popescu-Belis, 2017). Guillou and Hardmeier (2018) perform a human meta-evaluation and show substantial disagreement between reference-based metrics and human judges, especially because there often exist valid alternative translations that use different pronouns than the reference. Our test set, and our protocol of generating contrastive examples, is focused on selected pronouns to minimize the risk of producing contrastive examples that are actually valid translations.…”
Section: Evaluation Of Pronoun Translationmentioning
confidence: 99%
“…4 An alternative approach to automatically evaluate pronoun translation are reference-based methods that produce a score based on word alignment between source, translation output, and reference translation, and identification of pronouns in them, such as AutoPRF (Hardmeier and Federico, 2010) and APT (Miculicich Werlen and Popescu-Belis, 2017). Guillou and Hardmeier (2018) perform a human meta-evaluation and show substantial disagreement between reference-based metrics and human judges, especially because there often exist valid alternative translations that use different pronouns than the reference. Our test set, and our protocol of generating contrastive examples, is focused on selected pronouns to minimize the risk of producing contrastive examples that are actually valid translations.…”
Section: Evaluation Of Pronoun Translationmentioning
confidence: 99%
“…Most of these perform evaluation based on a reference and do not take the context into account. There are also those that suggest using evaluation test sets or better yet combining them with semi-automatic evaluation schemes [24]. More recently, Stojanovski and Fraser [103] propose to use oracle experiments for evaluating the effect of pronoun resolution and coherence in MT.…”
Section: Discussionmentioning
confidence: 99%
“…Jwalapuram et al (2019) propose a model for pronoun translation evaluation trained on pairs of sentences consisting of the reference and a system output with differing pronouns. However, as Guillou and Hardmeier (2018) point out, this fails to take into account that often there is not a 1:1 correspondence between pronouns in different languages. As a result, a system translation may be correct despite not containing the exact pronoun in the reference, and incorrect even if containing the pronoun in the reference, because of differences in the translation of the referent.…”
Section: Coreference Resolution In Machine Translationmentioning
confidence: 99%
“…Alternatives to bleu include F 1 , partial credit, and oracle-guided approaches (Hardmeier and Federico, 2010;Guillou and Hardmeier, 2016;Miculicich Werlen and Popescu-Belis, 2017). However, Guillou and Hardmeier (2018) show that these metrics can miss important cases and propose semi-automatic evaluation. In contrast, our evaluation is completely automatic.…”
Section: Coreference Resolution In Machine Translationmentioning
confidence: 99%