Results of the WMT17 Metrics Shared Task

Bojar, Ondřej; Graham, Yvette; Kamran, Amir

doi:10.18653/v1/w17-4755

Cited by 103 publications

(112 citation statements)

References 24 publications

Supporting

Mentioning

112

Contrasting

Order By: Relevance

“…The common practice in MT research is to evaluate the model performance on a test set against one or more human reference translations. The most widespread automatic metric is undoubtedly the BLEU score (Papineni et al, 2002), despite its acknowledged problems and better-performing alternatives (Bojar et al, 2017b). For simplicity, we stick to BLEU, too (we evaluated all our results also with F (Popović, 2015), but found no substantial differences from BLEU).…”

mentioning

confidence: 96%

Training Tips for the Transformer Model

Popel¹,

Bojar²

2018

The Prague Bulletin of Mathematical Linguistics

Self Cite

220

174

View full text Add to dashboard Cite

This article describes our experiments in neural machine translation using the recent Ten-sor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017). We examine some of the critical parameters that affect the final translation quality, memory usage, training stability and training time, concluding each experiment with a set of recommendations for fellow researchers. In addition to confirming the general mantra "more data and larger models", we address scaling to multiple GPUs and provide practical tips for improved training regarding batch size, learning rate, warmup steps, maximum sentence length and checkpoint averaging. We hope that our observations will allow others to get better results given their particular hardware and data constraints.In this article, we experiment with a relatively new NMT model, called Transformer (Vaswani et al., 2017) as implemented in the Tensor2Tensor (abbreviated T2T) toolkit, version 1.2.9. The model and the toolkit have been released shortly after the evaluation campaign at WMT2017 and its behavior on large-data news translation is not yet fully explored. We want to empirically explore some of the important hyper-parameters. Hopefully, our observations will be useful also for other researchers considering this model and framework.While investigations into the effect of hyper-parameters like learning rate and batch size are available in the deep-learning community (e.g. Bottou et al., 2016;Smith and Le, 2017;Jastrzebski et al., 2017), these are either mostly theoretic or experimentally supported from domains like image recognition rather than machine translation. In this article, we fill the gap by focusing exclusively on MT and on the Transformer model only, providing hopefully the best practices for this particular setting.Some of our observations confirm the general wisdom (e.g. larger training data are generally better) and quantify the behavior on English-to-Czech translation experiments. Some of our observations are somewhat surprising, e.g. that two GPUs are more than three times faster than a single GPU, or our findings about the interaction between maximum sentence length, learning rate and batch size.The article is structured as follows. In Section 2, we discuss our evaluation methodology and main criteria: translation quality and speed of training. Section 3 describes our dataset and its preparations. Section 4 is the main contribution of the article: a set of commented experiments, each with a set of recommendations. Finally, Section 5 compares our best Transformer run with systems participating in WMT17. We conclude in Section 6. Evaluation MethodologyMachine translation can be evaluated in many ways and some forms of human judgment should be always used for the ultimate resolution in any final application. The common practice in MT research is to evaluate the model performance on a test set against one or more human reference translations. The most widespread automatic metric is undoubtedly the BLEU score (Papineni et al., 2002), de...

show abstract

mentioning

confidence: 96%

Training Tips for the Transformer Model

Popel¹,

Bojar²

2018

The Prague Bulletin of Mathematical Linguistics

Self Cite

220

174

View full text Add to dashboard Cite

show abstract

“…Throughout the paper, we report BLEU (Papineni et al, 2002) and chrF++ (Popović, 2017) scores. 5 The latter is known to correlate better than BLEU with human judgements when the TL is highly inflected (Bojar et al, 2017), as is the case. Where reported, we assess whether differences between systems' outputs are statistically significant for p < 0.05 with 1 000 iterations of paired bootstrap resampling (Koehn, 2004).…”

Section: Data Preparation and Training Detailsmentioning

confidence: 99%

The Universitat d’Alacant Submissions to the English-to-Kazakh News Translation Task at WMT 2019

Sánchez-Cartagena¹,

Pérez-Ortiz²,

Sánchez-Martínez³

2019

Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

View full text Add to dashboard Cite

This paper describes the two submissions of Universitat d'Alacant to the English-to-Kazakh news translation task at WMT 2019. Our submissions take advantage of monolingual data and parallel data from other language pairs by means of iterative backtranslation, pivot backtranslation and transfer learning. They also use linguistic information in two ways: morphological segmentation of Kazakh text, and integration of the output of a rule-based machine translation system. Our systems were ranked 2 nd in terms of chrF++ despite being built from an ensemble of only 2 independent training runs.

show abstract

“…ENTFp (Yu et al, 2015a) evaluates the fluency of an MT hypothesis. After the success of DPMF comb , Blend 2 (Ma et al, 2017) achieved the best performance in the WMT-2017 Metrics task (Bojar et al, 2017). Similar to DPMF comb , Blend is essentially an SVR (RBF kernel) model that uses the scores of various metrics as features.…”

Section: Related Workmentioning

confidence: 99%

“…Various MTE metrics have been proposed in the metrics task of the Workshops on Statistical Machine Translation (WMT) that was started in 2008. However, most MTE metrics are obtained by computing the similarity between an MT hypothesis and a reference translation based on character N-grams or word N-grams, such as SentBLEU (Lin and Och, 2004), which is a smoothed version of BLEU (Papineni et al, 2002), Blend (Ma et al, 2017), MEANT 2.0 (Lo, 2017), and chrF++ (Popović, 2017), which achieved excellent results in the WMT-2017 Metrics task (Bojar et al, 2017). Therefore, they can exploit only limited information for segment-level MTE.…”

Section: Introductionmentioning

confidence: 99%

Metric for Automatic Machine Translation Evaluation based on Universal Sentence Representations

Shimanaka

Kajiwara

Komachi

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: St

View full text Add to dashboard Cite

Sentence representations can capture a wide range of information that cannot be captured by local features based on character or word N-grams. This paper examines the usefulness of universal sentence representations for evaluating the quality of machine translation. Although it is difficult to train sentence representations using small-scale translation datasets with manual evaluation, sentence representations trained from large-scale data in other tasks can improve the automatic evaluation of machine translation. Experimental results of the WMT-2016 dataset show that the proposed method achieves state-of-the-art performance with sentence representation features only.

show abstract

Results of the WMT17 Metrics Shared Task

Cited by 103 publications

References 24 publications

Training Tips for the Transformer Model

Training Tips for the Transformer Model

The Universitat d’Alacant Submissions to the English-to-Kazakh News Translation Task at WMT 2019

Metric for Automatic Machine Translation Evaluation based on Universal Sentence Representations

Contact Info

Product

Resources

About