Mining a comparable text corpus for a Vietnamese - French statistical machine translation system

Diep, Thi Ngoc; Le, Viet Bac; Bigi, Brigitte; Besacier, Laurent; Castelli, Eric

doi:10.3115/1626431.1626466

Cited by 13 publications

(6 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Another direction, however, is to identify bitexts using only textual information, as the metadata associated with documents can often be sparse or unreliable (Uszkoreit et al, 2010). Some text-based approaches for identifying bitexts rely on methods such as n-gram scoring (Uszkoreit et al, 2010), named entity matching (Do et al, 2009), and cross-language information retrieval (Utiyama and Isahara, 2003;Munteanu and Marcu, 2005).…”

Section: Related Workmentioning

confidence: 99%

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

Guo¹,

Shen²,

Yang³

et al. 2018

Proceedings of the Third Conference on Machine Translation: Research Papers

117

View full text Add to dashboard Cite

This paper presents an effective approach for parallel corpus mining using bilingual sentence embeddings. Our embedding models are trained to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. This is achieved using a novel training method that introduces hard negatives consisting of sentences that are not translations but that have some degree of semantic similarity. The quality of the resulting embeddings are evaluated on parallel corpus reconstruction and by assessing machine translation systems trained on gold vs. mined sentence pairs. We find that the sentence embeddings can be used to reconstruct the United Nations Parallel Corpus (Ziemski et al., 2016) at the sentence level with a precision of 48.9% for en-fr and 54.9% for enes. When adapted to document level matching, we achieve a parallel document matching accuracy that is comparable to the significantly more computationally intensive approach of Uszkoreit et al. (2010). Using reconstructed parallel data, we are able to train NMT models that perform nearly as well as models trained on the original data (within 1-2 BLEU).

show abstract

Section: Related Workmentioning

confidence: 99%

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

Guo¹,

Shen²,

Yang³

et al. 2018

Proceedings of the Third Conference on Machine Translation: Research Papers

117

View full text Add to dashboard Cite

show abstract

“…Other approaches have identified parallel documents in unstructured web corpora by relying on metadata (Nie et al, 1999;Espla-Gomis and Forcada, 2010). Some of these methods have focused on publication date and other temporal heuristics to aid in identifying parallel documents Marcu, 2005, 2006;Udupa et al, 2009;Do et al, 2009;Abdul-Rauf and Schwenk, 2009). However, temporal features can be sparse, noisy, and unreliable.…”

Section: Related Workmentioning

confidence: 99%

CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

El-Kishky

Chaudhary

Guzmán

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average precision of 94.5% across different language pairs. We mine sixty-eight snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of over 392 million URL pairs from Common Crawl covering documents in 8144 language pairs of which 137 pairs include English. In addition to curating this massive dataset, we introduce baseline methods that leverage crosslingual representations to identify aligned documents based on their textual content. Finally, we demonstrate the value of this parallel documents dataset through a downstream task of mining parallel sentences and measuring the quality of machine translations from models trained on this mined data. Our objective in releasing this dataset is to foster new research in cross-lingual NLP across a variety of low, medium, and high-resource languages.

show abstract

“…Other approaches have identified parallel documents in unstructured web corpora by relying on metadata. Some of these methods have focused on publication date and other temporal heuristics to aid in identifying parallel documents [1,10,28,29,40]. However, temporal features are often sparse, noisy, and unreliable.…”

Section: Related Workmentioning

confidence: 99%

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

El-Kishky¹,

Guzmán²

2020

Preprint

View full text Add to dashboard Cite

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel bitexts for machine translation training. In this paper we develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages. These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs. Recognizing that our proposed scoring function and other state of the art methods are computationally intractable for long web documents, we utilize a more tractable greedy algorithm that performs comparably. We experimentally demonstrate that our distance metric performs better alignment than current baselines outperforming them by 7% on high-resource language pairs, 15% on mid-resource language pairs, and 22% on low-resource language pairs.

show abstract

Mining a comparable text corpus for a Vietnamese - French statistical machine translation system

Cited by 13 publications

References 13 publications

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

Contact Info

Product

Resources

About