Proceedings of the Third Conference on Machine Translation: Shared Task Papers 2018
DOI: 10.18653/v1/w18-6487
|View full text |Cite
|
Sign up to set email alerts
|

The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task

Abstract: This paper describes the submission of RWTH Aachen University for the De→En parallel corpus filtering task of the EMNLP 2018 Third Conference on Machine Translation (WMT 2018). We use several rule-based, heuristic methods to preselect sentence pairs. These sentence pairs are scored with count-based and neural systems as language and translation models. In addition to single sentence-pair scoring, we further implement a simple redundancy removing heuristic. Our best performing corpus filtering system relies on … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
24
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
1
1

Relationship

3
4

Authors

Journals

citations
Cited by 20 publications
(25 citation statements)
references
References 9 publications
0
24
0
Order By: Relevance
“…Pivot-based sentence embedding (Schwenk and Douze, 2017) improves upon the random sampling, but it has an impractical data condition. The four-model combination of NMT models and LMs (Rossenbach et al, 2018) provide 1-3% more BLEU improvement. Note that, for the third method, each model costs 1-2 weeks to train.…”
Section: Baselinesmentioning
confidence: 98%
See 1 more Smart Citation
“…Pivot-based sentence embedding (Schwenk and Douze, 2017) improves upon the random sampling, but it has an impractical data condition. The four-model combination of NMT models and LMs (Rossenbach et al, 2018) provide 1-3% more BLEU improvement. Note that, for the third method, each model costs 1-2 weeks to train.…”
Section: Baselinesmentioning
confidence: 98%
“…The task is to score each line of a very noisy, web-crawled corpus of 104M parallel lines (ParaCrawl English-German). We pre-filtered the given raw corpus with the heuristics of Rossenbach et al (2018). Only the data for WMT 2018 English-German news translation task is allowed to train scoring models.…”
Section: Datamentioning
confidence: 99%
“…As a result, the MT field faces various data quality issues such as misalignment and incorrect translations, which may significantly impact translation quality . A straightforward solution is to apply a filtering approach, where noisy data are filtered out and a smaller subset of high-quality sentence pairs is retained (Bei et al, 2018;Junczys-Dowmunt, 2018;Rossenbach et al, 2018). Nevertheless, it is unclear whether such a filtering approach can be successfully applied to GEC, where commonly available datasets tend to be far smaller than those used in recent neural MT research.…”
Section: Related Workmentioning
confidence: 99%
“…We evaluated the effectiveness of our method over several GEC datasets, and found that it considerably outperformed baseline methods, includ-ing three strong denoising baselines based on a filtering approach, which is a common approach in MT (Bei et al, 2018;Junczys-Dowmunt, 2018;Rossenbach et al, 2018). We further improved the performance by applying task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks.…”
Section: Introductionmentioning
confidence: 96%
“…It achieves 31.3% BLEU and 29.9% BLEU on the En→De task on newstest2015 and newstest2017 respectively. To filter out sentence pairs that were copied instead of translated by the system, we apply a filtering method based on the Levenshtein distance between source and target sentences (Rossenbach et al, 2018). This has further reduced the synthetic corpus size to 15.9M sentence pairs which are used to train our final systems.…”
Section: German→englishmentioning
confidence: 99%