Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2) 2019
DOI: 10.18653/v1/w19-5439
|View full text |Cite
|
Sign up to set email alerts
|

Filtering of Noisy Parallel Corpora Based on Hypothesis Generation

Abstract: The filtering task of noisy parallel corpora in WMT2019 aims to challenge participants to create filtering methods to be useful for training machine translation systems. In this work, we introduce a noisy parallel corpora filtering system based on generating hypotheses by means of a translation model. We train translation models in both language pairs: Nepali-English and Sinhala-English using provided parallel corpora. To create the best possible translation model, we first join all provided parallel corpora (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 9 publications
(7 reference statements)
0
4
0
Order By: Relevance
“…Another approach is to first train a translation system on the clean data, then use it to translate the non-English side into English and use monolingual matching methods to compare it against the English side of the parallel corpus. Different matching metrics were used: METEOR (Erdmann and Gwinnup, 2019), Levenshtein distance (Sen et al, 2019), or BLEU (Parcheta et al, 2019), Several submissions considered vocabulary coverage in their methods, preferring to add sentence pairs to the limited set that increase the number of words and n-grams covered (Erdmann and Gwinnup, 2019;Bernier-Colborne and Lo, 2019;González-Rubio, 2019).…”
Section: Methods Used By Participantsmentioning
confidence: 99%
See 1 more Smart Citation
“…Another approach is to first train a translation system on the clean data, then use it to translate the non-English side into English and use monolingual matching methods to compare it against the English side of the parallel corpus. Different matching metrics were used: METEOR (Erdmann and Gwinnup, 2019), Levenshtein distance (Sen et al, 2019), or BLEU (Parcheta et al, 2019), Several submissions considered vocabulary coverage in their methods, preferring to add sentence pairs to the limited set that increase the number of words and n-grams covered (Erdmann and Gwinnup, 2019;Bernier-Colborne and Lo, 2019;González-Rubio, 2019).…”
Section: Methods Used By Participantsmentioning
confidence: 99%
“…building neural machine translation models on the clean data and considering the translation scores from forced translation of the parallel corpus. One submission used this method , while others applied the same idea to monolingual language model scores (Axelrod, 2019;Parcheta et al, 2019).…”
Section: Methods Used By Participantsmentioning
confidence: 99%
“…Another approach is to first train a translation system on the clean data, then use it to translate the non-English side into English and use monolingual matching methods to compare it against the English side of the parallel corpus. Different matching metrics were used: METEOR (Erdmann and Gwinnup, 2019), Levenshtein distance (Sen et al, 2019), or BLEU (Parcheta et al, 2019),…”
Section: Sentence Pair Filteringmentioning
confidence: 99%
“…Chaudhary et al (2019) propose the use of cross-lingual sentence embeddings for determining sentence pair quality while several efforts (Kurfalı and Östling, 2019;Soares and Costajussà, 2019;Bernier-Colborne and Lo, 2019) have focused on the use of monolingual word embeddings. Parcheta et al (2019) use a machine translation system trained on clean data to translate the source sentences of the noisy corpus and evaluate the translation against the original target sentences using BLEU scores. Erdmann and Gwinnup (2019) and Sen et al (2019) propose similar methods using METEOR scores and Levenshtein distance respectively.…”
Section: Related Workmentioning
confidence: 99%