Parallel Corpus Filtering via Pre-trained Language Models

Zhang, Boliang; Nagesh, Ajay; Knight, Kevin

doi:10.18653/v1/2020.acl-main.756

Cited by 17 publications

(6 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Web crawling has become a popular method for corpus acquisition, though the resulting corpus often contains considerable noise and impurities. Zhang et al [41] utilized the multilingual capabilities of BERT for sentence alignment and employed the Generative Pre-Training (GPT) language model as a domain filter to achieve data domain balance. Cao et al's [42] analysis of BERT identified systematic issues, such as misalignments in open class lexemes and word pairs across different character sets, which were corrected through a series of alignment procedures.…”

Section: Text Alignmentmentioning

confidence: 99%

WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

Zhang,

Su,

Tian

et al. 2024

Electronics

View full text Add to dashboard Cite

This research introduces WCC-EC 2.0 (Web-Crawled Corpus—English and Chinese), a comprehensive parallel corpus designed for enhancing Neural Machine Translation (NMT), featuring over 1.6 million English-Chinese sentence pairs meticulously gathered via web crawling. This corpus, extracted through an advanced web crawler, showcases the vast linguistic diversity and richness of English and Chinese, uniquely spanning the rarely covered news and music domains. Our methodical approach in web crawling and corpus assembly, coupled with rigorous experiments and manual evaluations, demonstrated its superiority by achieving high BLEU scores, marking significant strides in translation accuracy and model resilience. Its inclusion of these specific areas adds significant value, providing a unique dataset that enriches the scope for NMT research and development. With the rise of NMT technology, WCC-EC 2.0 emerges not only as an invaluable resource for researchers and developers, but also as a pivotal tool for improving translation accuracy, training more resilient models, and promoting interlingual communication.

show abstract

Section: Text Alignmentmentioning

confidence: 99%

WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

Zhang,

Su,

Tian

et al. 2024

Electronics

View full text Add to dashboard Cite

show abstract

“…Data selection and data filter methods have been widely used in NMT. To balance data domains or enhance the data quality generated by back-translation (Sennrich et al, 2016b), many approaches have been proposed, such as utilizing language models (Moore and Lewis, 2010;van der Wees et al, 2017;Zhang et al, 2020), translation models (Junczys-Dowmunt, 2018;Wang et al, 2019a), and curriculum learning (Zhang et al, 2019b;Wang et al, 2019b). Different from the above methods, our MSO dynamically combines language models with translation models for data selection during training, making full use of the models.…”

Section: Related Workmentioning

confidence: 99%

Prevent the Language Model from being Overconfident in Neural Machine Translation

Miao¹,

Meng²,

Liu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

The Neural Machine Translation (NMT) model is essentially a joint language model conditioned on both the source sentence and partial translation. Therefore, the NMT model naturally involves the mechanism of the Language Model (LM) that predicts the next token only based on partial translation. Despite its success, NMT still suffers from the hallucination problem, generating fluent but inadequate translations. The main reason is that NMT pays excessive attention to the partial translation while neglecting the source sentence to some extent, namely overconfidence of the LM. Accordingly, we define the Margin between the NMT and the LM, calculated by subtracting the predicted probability of the LM from that of the NMT model for each token. The Margin is negatively correlated to the overconfidence degree of the LM. Based on the property, we propose a Margin-based Token-level Objective (MTO) and a Margin-based Sentencelevel Objective (MSO) to maximize the Margin for preventing the LM from being overconfident. Experiments on WMT14 Englishto-German, WMT19 Chinese-to-English, and WMT14 English-to-French translation tasks demonstrate the effectiveness of our approach, with 1.36, 1.50, and 0.63 BLEU improvements, respectively, compared to the Transformer baseline. The human evaluation further verifies that our approaches improve translation adequacy as well as fluency. 1 * Equal contribution. This work was done when Mengqi Miao was interning at

show abstract

“…In contrast, reference-free evaluation is most naturally applied at the system (test-set) level, and ideally should make no assumptions about the sys-tems under evaluation. The second task is parallelcorpus mining (Zhang et al, 2020;, which aims to identify valid translations at various levels of granularity. Its scoring aspect is similar to reference-free evaluation, but it is applied to a different input distribution, attempting to identify human-generated translation pairs rather than scoring MT outputs for a given human-generated source text.…”

Section: Related Workmentioning

confidence: 99%

Assessing Reference-Free Peer Evaluation for Machine Translation

Agrawal¹,

Foster²,

Freitag³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large, multilingual model can achieve state of the art results when used as a reference-free metric. We experiment with various modifications to this model, and demonstrate that by scaling it up we can match the performance of BLEU. We analyze various potential weaknesses of the approach, and find that it is surprisingly robust and likely to offer reasonable performance across a broad spectrum of domains and different system qualities.

show abstract

Parallel Corpus Filtering via Pre-trained Language Models

Cited by 17 publications

References 16 publications

WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

Prevent the Language Model from being Overconfident in Neural Machine Translation

Assessing Reference-Free Peer Evaluation for Machine Translation

Contact Info

Product

Resources

About