The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task

Rossenbach, Nick; Rosendahl, Jan; Kim, Yunsu; Graça, Miguel; Gokrani, Aman; Ney, Hermann

doi:10.18653/v1/w18-6487

Cited by 20 publications

(25 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pivot-based sentence embedding (Schwenk and Douze, 2017) improves upon the random sampling, but it has an impractical data condition. The four-model combination of NMT models and LMs (Rossenbach et al, 2018) provide 1-3% more BLEU improvement. Note that, for the third method, each model costs 1-2 weeks to train.…”

Section: Baselinesmentioning

confidence: 98%

See 1 more Smart Citation

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

Kim

Rosendahl²,

Rossenbach

et al. 2019

Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Self Cite

View full text Add to dashboard Cite

We propose a novel model architecture and training algorithm to learn bilingual sentence embeddings from a combination of parallel and monolingual data. Our method connects autoencoding and neural machine translation to force the source and target sentence embeddings to share the same space without the help of a pivot language or an additional transformation. We train a multilayer perceptron on top of the sentence embeddings to extract good bilingual sentence pairs from nonparallel or noisy parallel data. Our approach shows promising performance on sentence alignment recovery and the WMT 2018 parallel corpus filtering tasks with only a single model.• We use a multilayer perceptron (MLP) as a trainable similarity measure to match source and target sentence embeddings.• We compare various similarity measures for embeddings in terms of score distribution, geometric interpretation, and performance in downstream tasks.• We demonstrate competitive performance in sentence alignment recovery and parallel cor-

show abstract

Section: Baselinesmentioning

confidence: 98%

“…The task is to score each line of a very noisy, web-crawled corpus of 104M parallel lines (ParaCrawl English-German). We pre-filtered the given raw corpus with the heuristics of Rossenbach et al (2018). Only the data for WMT 2018 English-German news translation task is allowed to train scoring models.…”

Section: Datamentioning

confidence: 99%

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

Kim

Rosendahl²,

Rossenbach

et al. 2019

Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Self Cite

View full text Add to dashboard Cite

show abstract

“…As a result, the MT field faces various data quality issues such as misalignment and incorrect translations, which may significantly impact translation quality . A straightforward solution is to apply a filtering approach, where noisy data are filtered out and a smaller subset of high-quality sentence pairs is retained (Bei et al, 2018;Junczys-Dowmunt, 2018;Rossenbach et al, 2018). Nevertheless, it is unclear whether such a filtering approach can be successfully applied to GEC, where commonly available datasets tend to be far smaller than those used in recent neural MT research.…”

Section: Related Workmentioning

confidence: 99%

“…We evaluated the effectiveness of our method over several GEC datasets, and found that it considerably outperformed baseline methods, includ-ing three strong denoising baselines based on a filtering approach, which is a common approach in MT (Bei et al, 2018;Junczys-Dowmunt, 2018;Rossenbach et al, 2018). We further improved the performance by applying task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks.…”

Section: Introductionmentioning

confidence: 96%

A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction

Mita

Kiyono

Kaneko³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected. To address this, we designed a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models, and outperformed strong denoising baseline methods. We further applied task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks. We then analyzed the effect of the proposed denoising method, and found that our approach leads to improved coverage of corrections and facilitated fluency edits which are reflected in higher recall and overall performance.

show abstract

“…It achieves 31.3% BLEU and 29.9% BLEU on the En→De task on newstest2015 and newstest2017 respectively. To filter out sentence pairs that were copied instead of translated by the system, we apply a filtering method based on the Levenshtein distance between source and target sentences (Rossenbach et al, 2018). This has further reduced the synthetic corpus size to 15.9M sentence pairs which are used to train our final systems.…”

Section: German→englishmentioning

confidence: 99%

The RWTH Aachen University Machine Translation Systems for WMT 2019

Rosendahl¹,

Herold²,

Kim³

et al. 2019

Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

Self Cite

View full text Add to dashboard Cite

This paper describes the neural machine translation systems developed at the RWTH Aachen University for the De→En, Zh→En and Kk→En news translation tasks of the Fourth Conference on Machine Translation (WMT19). For all tasks, the final submitted system is based on the Transformer architecture. We focus on improving data filtering and fine-tuning as well as systematically evaluating interesting approaches like unigram language model segmentation and transfer learning. For the De→En task, none of the tested methods gave a significant improvement over last years winning system and we end up with the same performance, resulting in 39.6% BLEU on newstest2019. In the Zh→En task, we show 1.3% BLEU improvement over our last year's submission, which we mostly attribute to the splitting of long sentences during translation. We further report results on the Kk→En task where we gain improvements of 11.1% BLEU over our baseline system. On the same task we present a recent transfer learning approach, which uses half of the free parameters of our submission system and performs on par with it.

show abstract

The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task

Cited by 20 publications

References 9 publications

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction

The RWTH Aachen University Machine Translation Systems for WMT 2019

Contact Info

Product

Resources

About