Proceedings of the Tenth Workshop on Statistical Machine Translation 2015
DOI: 10.18653/v1/w15-3031
|View full text |Cite
|
Sign up to set email alerts
|

Results of the WMT15 Metrics Shared Task

Abstract: This paper presents the results of the WMT13 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in WMT13 Shared Translation Task. We collected scores of 16 metrics from 8 research groups. In addition to that we computed scores of 5 standard metrics such as BLEU, WER, PER as baselines. Collected scores were evaluated in terms of system level correlation (how well each metric's scores correlate with WMT13 official human scores) and in terms of segment level co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
61
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 65 publications
(63 citation statements)
references
References 18 publications
2
61
0
Order By: Relevance
“…A main venue for evaluation of MT metrics is the annual Workshop for Statistical Machine Translation (WMT) where large-scale human evaluation takes place, primarily for the purpose of ranking systems competing in the translation shared task, but additionally to use the resulting system rankings for evaluation of automatic metrics. Since 2014, WMT has used the Pearson correlation as the official measure for evaluation of metrics (Macháček and Bojar, 2014;Stanojević et al, 2015). Comparison of the performance of any two metrics involves the comparison of two Pearson correlation point estimates computed over a sample of MT systems, therefore.…”
Section: Introductionmentioning
confidence: 99%
“…A main venue for evaluation of MT metrics is the annual Workshop for Statistical Machine Translation (WMT) where large-scale human evaluation takes place, primarily for the purpose of ranking systems competing in the translation shared task, but additionally to use the resulting system rankings for evaluation of automatic metrics. Since 2014, WMT has used the Pearson correlation as the official measure for evaluation of metrics (Macháček and Bojar, 2014;Stanojević et al, 2015). Comparison of the performance of any two metrics involves the comparison of two Pearson correlation point estimates computed over a sample of MT systems, therefore.…”
Section: Introductionmentioning
confidence: 99%
“…We see in Tables 1 and 2 that our models that use images directly to initialise either the encoder or the decoder are the only ones to consistently outperform the PBSMT baseline according to the chrF3 metric, a character-based metric that in- cludes both precision and recall, and has a recall bias. That is also a noteworthy finding, since chrF3 is the only character-level metric we use, and it has shown a high correlation with human judgements (Stanojević et al, 2015).…”
Section: Resultsmentioning
confidence: 55%
“…The evaluation metrics are correlated with human rankings by means of Spearman's rank correlation coefficient for the WMT13 task (Macháček and Bojar, 2013) and Pearson product-moment correlation coefficient for the WMT14 task (Macháček and Bojar, 2014) and WMT15 task (Stanojević et al, 2015) on the system level. Through the experiments we aim to investigate the following points:…”
Section: Methodsmentioning
confidence: 99%
“…The best results in each direction are in bold. We calculated the CharacTER and CHRF3 scores and cited the other scores from the WMT metric papers (Macháček and Bojar, 2013;Macháček and Bojar, 2014;Stanojević et al, 2015). * English→German scores are not included in the averages of the WMT14 metrick task.…”
Section: Comparison With Other Metricsmentioning
confidence: 99%