Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.417
|View full text |Cite
|
Sign up to set email alerts
|

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Abstract: We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
74
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 96 publications
(101 citation statements)
references
References 44 publications
0
74
0
Order By: Relevance
“…The harvesting of parallel data from the web has been shown successfully by [4,18], resulting in highly heterogeneous collected data, as sampled from the entire web. Thus, the distribution of the content is inevitably dominated by the commercial websites working in a multi-language setting.…”
Section: Introductionmentioning
confidence: 99%
“…The harvesting of parallel data from the web has been shown successfully by [4,18], resulting in highly heterogeneous collected data, as sampled from the entire web. Thus, the distribution of the content is inevitably dominated by the commercial websites working in a multi-language setting.…”
Section: Introductionmentioning
confidence: 99%
“…We build our PyTorch implementation on top of HuggingFace's Transformers library (Wolf et al, 2020). Training data is composed of the ParaCrawl 8 (Bañón et al, 2020) v5.0 datasets for each language pair. We experiment on English-French, English-German, English-Spanish and English-Italian.…”
Section: Methodsmentioning
confidence: 99%
“…They used a German-French test set and achieved state-of-the-art results. Their method also effectively works for low-and medium-resource language pairs with the Bible dataset and used for building the ParaCrawl corpus, which is one of the largest parallel corpus across 23 EU languages with English by crawling the web (Bañón et al, 2020).…”
Section: Related Workmentioning
confidence: 99%