Proceedings of the Second Conference on Machine Translation 2017
DOI: 10.18653/v1/w17-4755
|View full text |Cite
|
Sign up to set email alerts
|

Results of the WMT17 Metrics Shared Task

Abstract: This paper presents the results of the WMT17 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT17 news translation task and Neural MT training task. We collected scores of 14 metrics from 8 research groups. In addition to that, we computed scores of 7 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric's scores correlate with WM… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
112
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 103 publications
(112 citation statements)
references
References 24 publications
0
112
0
Order By: Relevance
“…The common practice in MT research is to evaluate the model performance on a test set against one or more human reference translations. The most widespread automatic metric is undoubtedly the BLEU score (Papineni et al, 2002), despite its acknowledged problems and better-performing alternatives (Bojar et al, 2017b). For simplicity, we stick to BLEU, too (we evaluated all our results also with F (Popović, 2015), but found no substantial differences from BLEU).…”
mentioning
confidence: 96%
“…The common practice in MT research is to evaluate the model performance on a test set against one or more human reference translations. The most widespread automatic metric is undoubtedly the BLEU score (Papineni et al, 2002), despite its acknowledged problems and better-performing alternatives (Bojar et al, 2017b). For simplicity, we stick to BLEU, too (we evaluated all our results also with F (Popović, 2015), but found no substantial differences from BLEU).…”
mentioning
confidence: 96%
“…Throughout the paper, we report BLEU (Papineni et al, 2002) and chrF++ (Popović, 2017) scores. 5 The latter is known to correlate better than BLEU with human judgements when the TL is highly inflected (Bojar et al, 2017), as is the case. Where reported, we assess whether differences between systems' outputs are statistically significant for p < 0.05 with 1 000 iterations of paired bootstrap resampling (Koehn, 2004).…”
Section: Data Preparation and Training Detailsmentioning
confidence: 99%
“…ENTFp (Yu et al, 2015a) evaluates the fluency of an MT hypothesis. After the success of DPMF comb , Blend 2 (Ma et al, 2017) achieved the best performance in the WMT-2017 Metrics task (Bojar et al, 2017). Similar to DPMF comb , Blend is essentially an SVR (RBF kernel) model that uses the scores of various metrics as features.…”
Section: Related Workmentioning
confidence: 99%
“…Various MTE metrics have been proposed in the metrics task of the Workshops on Statistical Machine Translation (WMT) that was started in 2008. However, most MTE metrics are obtained by computing the similarity between an MT hypothesis and a reference translation based on character N-grams or word N-grams, such as SentBLEU (Lin and Och, 2004), which is a smoothed version of BLEU (Papineni et al, 2002), Blend (Ma et al, 2017), MEANT 2.0 (Lo, 2017), and chrF++ (Popović, 2017), which achieved excellent results in the WMT-2017 Metrics task (Bojar et al, 2017). Therefore, they can exploit only limited information for segment-level MTE.…”
Section: Introductionmentioning
confidence: 99%