2010
DOI: 10.1007/s10590-010-9072-7
|View full text |Cite
|
Sign up to set email alerts
|

Metric and reference factors in minimum error rate training

Abstract: In Minimum Error Rate Training (MERT), Bleu is often used as the error function, despite the fact that it has been shown to have a lower correlation with human judgment than other metrics such as Meteor and Ter. In this paper, we present empirical results in which parameters tuned on Bleu may lead to sub-optimal Bleu scores under certain data conditions. Such scores can be improved significantly by tuning on an entirely different metric altogether, e.g. Meteor, by 0.0082 Bleu or 3.38% relative improvement on t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2012
2012
2020
2020

Publication Types

Select...
2
2
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 10 publications
(11 reference statements)
0
4
0
Order By: Relevance
“…This turns out largely due to the fact that the 4-g LM tuned weight for the labeled systems is always far lower than for Hiero, suggesting that the 4-g LM has a smaller contribution during tuning for BLEU. Tuning for BLEU is not guaranteed to give improved performance on all metrics, as noted by He and Way (2009), but we do see here improved performance for three out of four metrics.…”
Section: Primary Results: Soft Bilingual Constraints and Basic+sparsementioning
confidence: 57%
See 1 more Smart Citation
“…This turns out largely due to the fact that the 4-g LM tuned weight for the labeled systems is always far lower than for Hiero, suggesting that the 4-g LM has a smaller contribution during tuning for BLEU. Tuning for BLEU is not guaranteed to give improved performance on all metrics, as noted by He and Way (2009), but we do see here improved performance for three out of four metrics.…”
Section: Primary Results: Soft Bilingual Constraints and Basic+sparsementioning
confidence: 57%
“…it seems that TER is penalizing more heavily longer output even if it is closer in length to the reference (cf. (He and Way 2009)). This turns out largely due to the fact that the 4-g LM tuned weight for the labeled systems is always far lower than for Hiero, suggesting that the 4-g LM has a smaller contribution during tuning for BLEU.…”
Section: Primary Results: Soft Bilingual Constraints and Basic+sparsementioning
confidence: 99%
“…In addition, we find that both DTU and our systems do not achieve consistent improvements over Treelet in terms of TER. We observed that both DTU and our systems tend to produce longer translations than Treelet, which might cause unreliable TER evaluation in our experiments as TER favours shorter sentences (He and Way, 2010).…”
Section: Resultsmentioning
confidence: 80%
“…Our work can be seen as replacing the regular BLEU metric with a new paraphrase BLEU metric for system tuning. Different alternative automatic evaluation metric have also been considered for system tuning (He and Way, 2010;Servan and Schwenk, 2011) with Minimum Error Rate Training, MERT (Och, 2003). This work showed some specific cases where Translation Error Rate (TER) was superior to BLEU.…”
Section: Related Workmentioning
confidence: 99%