This paper describes the submission of RWTH Aachen University for the De→En parallel corpus filtering task of the EMNLP 2018 Third Conference on Machine Translation (WMT 2018). We use several rule-based, heuristic methods to preselect sentence pairs. These sentence pairs are scored with count-based and neural systems as language and translation models. In addition to single sentence-pair scoring, we further implement a simple redundancy removing heuristic. Our best performing corpus filtering system relies on recurrent neural language models and translation models based on the transformer architecture. A model trained on 10M randomly sampled tokens reaches a performance of 9.2% BLEU on newstest2018. Using our filtering and ranking techniques we achieve 34.8% BLEU.
Back-translation -data augmentation by translating target monolingual data -is a crucial component in modern neural machine translation (NMT). In this work, we reformulate back-translation in the scope of crossentropy optimization of an NMT model, clarifying its underlying mathematical assumptions and approximations beyond its heuristic usage. Our formulation covers broader synthetic data generation schemes, including sampling from a target-to-source NMT model. With this formulation, we point out fundamental problems of the sampling-based approaches and propose to remedy them by (i) disabling label smoothing for the target-to-source model and (ii) sampling from a restricted search space. Our statements are investigated on the WMT 2018 German ↔ English news translation task.P r(f J 1 , e I 1 ) ·
This paper describes the statistical machine translation system developed at RWTH Aachen University for the English→German and German→English translation tasks of the EMNLP 2017 Second Conference on Machine Translation (WMT 2017). We use ensembles of attention-based neural machine translation system for both directions. We use the provided parallel and synthetic data to train the models. In addition, we also create a phrasal system using joint translation and reordering models in decoding and neural models in rescoring.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.