ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Bañón, Marta; Chen, Pinzhen; Haddow, Barry; Heafield, Kenneth; Hoang, Hieu T.; Esplà-Gomis, Miquel; Forcada, Mikel L.; Kamran, Amir; Kirefu, Faheem; Koehn, Philipp; Rojas, Sergio Ortiz; Sempere, Leopoldo Pla; Ramírez-Sánchez, Gema; Sarrías, Elsa; Marek, Strelec,; Thompson, Brian J.; Waites, William; Wiggins, Dion; Zaragoza, Jaume

doi:10.18653/v1/2020.acl-main.417

Cited by 96 publications

(101 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The harvesting of parallel data from the web has been shown successfully by [4,18], resulting in highly heterogeneous collected data, as sampled from the entire web. Thus, the distribution of the content is inevitably dominated by the commercial websites working in a multi-language setting.…”

Section: Introductionmentioning

confidence: 99%

Machine Translation Customization via Automatic Training Data Selection from the Web

Moschitti

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Machine translation (MT) systems, especially when designed for an industrial setting, are trained with general parallel data derived from the Web. Thus, their style is typically driven by word/structure distribution coming from the average of many domains. In contrast, MT customers want translations to be specialized to their domain, for which they are typically able to provide text samples. We describe an approach for customizing MT systems on specific domains by selecting data similar to the target customer data to train neural translation models. We build document classifiers using monolingual target data, e.g., provided by the customers to select parallel training data from Web crawled data. Finally, we train MT models on our automatically selected data, obtaining a system specialized to the target domain. We tested our approach on the benchmark from WMT-18 Translation Task for News domains enabling comparisons with state-of-the-art MT systems. The results show that our models outperform the top systems while using less data and smaller models.

show abstract

Section: Introductionmentioning

confidence: 99%

Machine Translation Customization via Automatic Training Data Selection from the Web

Moschitti

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…We build our PyTorch implementation on top of HuggingFace's Transformers library (Wolf et al, 2020). Training data is composed of the ParaCrawl 8 (Bañón et al, 2020) v5.0 datasets for each language pair. We experiment on English-French, English-German, English-Spanish and English-Italian.…”

Section: Methodsmentioning

confidence: 99%

Lightweight Cross-Lingual Sentence Representation Learning

Mao¹,

Gupta²,

Chu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Large-scale models for learning fixeddimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such large-scale models are usually impractical due to memory limitations. In this work, we introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations. We explore different training tasks and observe that current cross-lingual training tasks leave a lot to be desired for this shallow architecture. To ameliorate this, we propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task. We further augment the training task by the introduction of two computationally-lite sentence-level contrastive learning tasks to enhance the alignment of cross-lingual sentence representation space, which compensates for the learning bottleneck of the lightweight transformer for generative tasks. Our comparisons with competing models on cross-lingual sentence retrieval and multilingual document classification confirm the effectiveness of the newly proposed training tasks for a shallow model. 1

show abstract

“…They used a German-French test set and achieved state-of-the-art results. Their method also effectively works for low-and medium-resource language pairs with the Bible dataset and used for building the ParaCrawl corpus, which is one of the largest parallel corpus across 23 EU languages with English by crawling the web (Bañón et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

SpanAlign: Sentence Alignment Method based on Cross-Language Span Prediction and ILP

Chousa¹,

Nagata

Nishino

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

We propose a novel method of automatic sentence alignment from noisy parallel documents. We first formalize the sentence alignment problem as the independent predictions of spans in the target document from sentences in the source document. We then introduce a total optimization method using integer linear programming to prevent span overlapping and obtain non-monotonic alignments. We implement cross-language span prediction by fine-tuning pre-trained multilingual language models based on BERT architecture and train them using pseudo-labeled data obtained from unsupervised sentence alignment method. While the baseline methods use sentence embeddings and assume monotonic alignment, our method can capture the token-to-token interaction between the tokens of source and target text and handle non-monotonic alignments. In sentence alignment experiments on English-Japanese, our method achieved 70.3 F 1 scores, which are +8.0 points higher than the baseline method. In particular, our method improved by +53.9 F 1 scores for extracting non-parallel sentences. Our method improved the downstream machine translation accuracy by 4.1 BLEU scores when the extracted bilingual sentences are used for fine-tuning a pre-trained Japanese-to-English translation model. 1

show abstract

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Cited by 96 publications

References 44 publications

Machine Translation Customization via Automatic Training Data Selection from the Web

Machine Translation Customization via Automatic Training Data Selection from the Web

Lightweight Cross-Lingual Sentence Representation Learning

SpanAlign: Sentence Alignment Method based on Cross-Language Span Prediction and ILP

Contact Info

Product

Resources

About