Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task

Sánchez-Cartagena, Víctor M.; Bañón, Marta; Ortiz-Rojas, Sergio; Ramírez, Gema

doi:10.18653/v1/w18-6488

Cited by 29 publications

(11 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the case of EN-FR, we observed that removing alignments from a web-scraped corpus did not result in an increase in NMT performance, on the contrary, the decrease in data size outweighed the increased alignment quality. This result differs from previous work on misalignment detection and data cleaning in an NMT context, e.g., [4,11,14,20]. However, we noted that the web-scraped corpus EN-FR used for extrinsic evaluation was much cleaner in terms of misalignments and other noise than the corpora used in previous work, such as the OpenSubtitles and ParaCrawl corpus: the amount of misalignments and degree of misalignment (in terms of MAD score) present in the web-scraped corpus was probably too low to harm NMT performance, as is clear from the amount of data labeled as aligned by MAD (76% of the sentence pairs).…”

Section: Discussioncontrasting

confidence: 93%

“…Finally, sentences are scored based on fluency and diversity. More details are provided in Reference [20].…”

Section: Bicleanermentioning

confidence: 99%

“…This intrinsic evaluation showed that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100% for both language pairs. Our results were compared to the ones obtained with BiCleaner [20], another tool for misalignment detection used in the official release of the ParaCrawl corpus (https://paracrawl.eu).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

et al. 2019

View full text Add to dashboard Cite

To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR-NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.

show abstract

Section: Discussioncontrasting

confidence: 93%

“…Finally, sentences are scored based on fluency and diversity. More details are provided in Reference [20].…”

Section: Bicleanermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

et al. 2019

View full text Add to dashboard Cite

show abstract

“…One way to perform filtering is to only keep sentences with a better per-word cross-entropy than a certain threshold. Another way is to use Bicleaner, an off-the-shelf tool which scores sentence similarity at sentence pair level (Sánchez-Cartagena et al, 2018). Filtering is optional for post-expansion pruning.…”

Section: Filteringmentioning

confidence: 99%

Parallel Sentence Mining by Constrained Decoding

Chen¹,

Bogoychev²,

Heafield³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search. When benchmarked on the BUCC shared task, our method achieves results comparable to other submissions.

show abstract

“…When processing bilingual corpora, any meaning mismatches between the two languages are primarily viewed as noise for the downstream task. In shared tasks for filtering web-crawled parallel corpora (Koehn et al, 2018, the best performing systems rely on translation models, or cross-lingual sentence embeddings to place bilingual sentences on a clean to noisy scale (Junczys-Dowmunt, 2018;Sánchez-Cartagena et al, 2018;Lu et al, 2018;. When mining parallel segments in Wikipedia for the Wiki-Matrix corpus , examples are ranked using the LASER score (Artetxe and Schwenk, 2019), which computes cross-lingual similarity in a language-agnostic sentence embedding space.…”

Section: Introductionmentioning

confidence: 99%

Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank

Briakou¹,

Carpuat²

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Detecting fine-grained differences in content conveyed in different languages matters for cross-lingual NLP and multilingual corpora analysis, but it is a challenging machine learning problem since annotation is expensive and hard to scale. This work improves the prediction and annotation of finegrained semantic divergences. We introduce a training strategy for multilingual BERT models by learning to rank synthetic divergent examples of varying granularity. We evaluate our models on the Rationalized English-French Semantic Divergences, a new dataset released with this work, consisting of English-French sentence-pairs annotated with semantic divergence classes and token-level rationales. Learning to rank helps detect finegrained sentence-level divergences more accurately than a strong sentence-level similarity model, while token-level predictions have the potential of further distinguishing between coarse and fine-grained divergences.

show abstract

Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task

Cited by 29 publications

References 11 publications

Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

Parallel Sentence Mining by Constrained Decoding

Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank

Contact Info

Product

Resources

About