The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.756
|View full text |Cite
|
Sign up to set email alerts
|

Parallel Corpus Filtering via Pre-trained Language Models

Abstract: Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. In this paper, we propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models. We measure sentence parallelism by leveraging the multilingual capability of BE… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 17 publications
(6 citation statements)
references
References 16 publications
0
6
0
Order By: Relevance
“…Web crawling has become a popular method for corpus acquisition, though the resulting corpus often contains considerable noise and impurities. Zhang et al [41] utilized the multilingual capabilities of BERT for sentence alignment and employed the Generative Pre-Training (GPT) language model as a domain filter to achieve data domain balance. Cao et al's [42] analysis of BERT identified systematic issues, such as misalignments in open class lexemes and word pairs across different character sets, which were corrected through a series of alignment procedures.…”
Section: Text Alignmentmentioning
confidence: 99%
“…Web crawling has become a popular method for corpus acquisition, though the resulting corpus often contains considerable noise and impurities. Zhang et al [41] utilized the multilingual capabilities of BERT for sentence alignment and employed the Generative Pre-Training (GPT) language model as a domain filter to achieve data domain balance. Cao et al's [42] analysis of BERT identified systematic issues, such as misalignments in open class lexemes and word pairs across different character sets, which were corrected through a series of alignment procedures.…”
Section: Text Alignmentmentioning
confidence: 99%
“…Data selection and data filter methods have been widely used in NMT. To balance data domains or enhance the data quality generated by back-translation (Sennrich et al, 2016b), many approaches have been proposed, such as utilizing language models (Moore and Lewis, 2010;van der Wees et al, 2017;Zhang et al, 2020), translation models (Junczys-Dowmunt, 2018;Wang et al, 2019a), and curriculum learning (Zhang et al, 2019b;Wang et al, 2019b). Different from the above methods, our MSO dynamically combines language models with translation models for data selection during training, making full use of the models.…”
Section: Related Workmentioning
confidence: 99%
“…In contrast, reference-free evaluation is most naturally applied at the system (test-set) level, and ideally should make no assumptions about the sys-tems under evaluation. The second task is parallelcorpus mining (Zhang et al, 2020;, which aims to identify valid translations at various levels of granularity. Its scoring aspect is similar to reference-free evaluation, but it is applied to a different input distribution, attempting to identify human-generated translation pairs rather than scoring MT outputs for a given human-generated source text.…”
Section: Related Workmentioning
confidence: 99%