Filtering of Noisy Parallel Corpora Based on Hypothesis Generation

Parcheta, Zuzanna; Sanchis-Trilles, Germán; Casacuberta, Francisco

doi:10.18653/v1/w19-5439

Cited by 3 publications

(4 citation statements)

References 9 publications

(7 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another approach is to first train a translation system on the clean data, then use it to translate the non-English side into English and use monolingual matching methods to compare it against the English side of the parallel corpus. Different matching metrics were used: METEOR (Erdmann and Gwinnup, 2019), Levenshtein distance (Sen et al, 2019), or BLEU (Parcheta et al, 2019), Several submissions considered vocabulary coverage in their methods, preferring to add sentence pairs to the limited set that increase the number of words and n-grams covered (Erdmann and Gwinnup, 2019;Bernier-Colborne and Lo, 2019;González-Rubio, 2019).…”

Section: Methods Used By Participantsmentioning

confidence: 99%

See 1 more Smart Citation

Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions

Koehn¹,

Guzmán²,

Chaudhary³

et al. 2019

Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

View full text Add to dashboard Cite

Following the WMT 2018 Shared Task on Parallel Corpus Filtering (Koehn et al., 2018), we posed the challenge of assigning sentencelevel quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting 2% and 10% of the highest-quality data to be used to train machine translation systems. This year, the task tackled the low resource condition of Nepali-English and Sinhala-English. Eleven participants from companies, national research labs, and universities participated in this task.

show abstract

Section: Methods Used By Participantsmentioning

confidence: 99%

“…building neural machine translation models on the clean data and considering the translation scores from forced translation of the parallel corpus. One submission used this method , while others applied the same idea to monolingual language model scores (Axelrod, 2019;Parcheta et al, 2019).…”

Section: Methods Used By Participantsmentioning

confidence: 99%

Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions

Koehn¹,

Guzmán²,

Chaudhary³

et al. 2019

Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

View full text Add to dashboard Cite

show abstract

Section: Sentence Pair Filteringmentioning

confidence: 99%

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Bañón¹,

Chen²,

Haddow³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Chaudhary et al (2019) propose the use of cross-lingual sentence embeddings for determining sentence pair quality while several efforts (Kurfalı and Östling, 2019;Soares and Costajussà, 2019;Bernier-Colborne and Lo, 2019) have focused on the use of monolingual word embeddings. Parcheta et al (2019) use a machine translation system trained on clean data to translate the source sentences of the noisy corpus and evaluate the translation against the original target sentences using BLEU scores. Erdmann and Gwinnup (2019) and Sen et al (2019) propose similar methods using METEOR scores and Levenshtein distance respectively.…”

Section: Related Workmentioning

confidence: 99%

Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora

Kumar¹,

Koehn²,

Khudanpur³

2021

Preprint

View full text Add to dashboard Cite

Large web-crawled corpora represent an excellent resource for improving the performance of Neural Machine Translation (NMT) systems across several language pairs. However, since these corpora are typically extremely noisy, their use is fairly limited. Current approaches to dealing with this problem mainly focus on filtering using heuristics or single features such as language model scores or bi-lingual similarity. This work presents an alternative approach which learns weights for multiple sentence-level features. These feature weights which are optimized directly for the task of improving translation performance, are used to score and filter sentences in the noisy corpora more effectively. We provide results of applying this technique to building NMT systems using the Paracrawl corpus for Estonian-English and show that it beats strong single feature baselines and hand designed combinations. Additionally, we analyze the sensitivity of this method to different types of noise and explore if the learned weights generalize to other language pairs using the Maltese-English Paracrawl corpus.

show abstract

Filtering of Noisy Parallel Corpora Based on Hypothesis Generation

Cited by 3 publications

References 9 publications

Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions

Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora

Contact Info

Product

Resources

About