Dynamic Data Selection and Weighting for Iterative Back-Translation

Dou, Zi-Yi; Anastasopoulos, Antonios; Neubig, Graham

doi:10.18653/v1/2020.emnlp-main.475

Cited by 37 publications

(31 citation statements)

References 29 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some explores data symmetry (Freitag and Firat, 2020;Birch et al, 2008;Lin et al, 2019). Zero-shot translation in severely low resource settings exploits the massive multilinguality, cross-lingual transfer, pretraining, iterative backtranslation and freezing subnetworks (Lauscher et al, 2020;Nooralahzadeh et al, 2020;Pfeiffer et al, 2020;Baziotis et al, 2020;Chronopoulou et al, 2020;Lin et al, 2020;Thompson et al, 2018;Luong et al, 2014;Dou et al, 2020).…”

Section: Machine Polyglotism and Pretrainingmentioning

confidence: 99%

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Text-based Translation

Zhou¹,

Waibel²

2021

Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

View full text Add to dashboard Cite

We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. In other words, given a text in 124 source languages, we translate it into a severely low resource language using only ∼1,000 lines of low resource data without any external help. Firstly, we propose a systematic method to rank and choose source languages that are close to the low resource language. We call the linguistic definition of language family Family of Origin (FAMO), and we call the empirical definition of higher-ranked languages using our metrics Family of Choice (FAMC). Secondly, we build an Iteratively Pretrained Multilingual Orderpreserving Lexiconized Transformer (IPML) to train on ∼1,000 lines (∼3.5%) of low resource data. To translate named entities correctly, we build a massive lexicon table for 2,939 Bible named entities in 124 source languages, and include many that occur once and covers more than 66 severely low resource languages. Moreover, we also build a novel method of combining translations from different source languages into one. Using English as a hypothetical low resource language, we get a +23.9 BLEU increase over a multilingual baseline, and a +10.3 BLEU increase over our asymmetric baseline in the Bible dataset. We get a 42.8 BLEU score for Portuguese-English translation on the medical EMEA dataset. We also have good results for a real severely low resource Mayan language, Eastern Pokomchi.

show abstract

Section: Machine Polyglotism and Pretrainingmentioning

confidence: 99%

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Text-based Translation

Zhou¹,

Waibel²

2021

Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

View full text Add to dashboard Cite

show abstract

“…Some explores data symmetry (Freitag and Birch et al, 2008;Lin et al, 2019). Zero-shot translation in severely low resource settings exploits the massive multilinguality, cross-lingual transfer, pretraining, iterative backtranslation and freezing subnetworks (Lauscher et al, 2020;Nooralahzadeh et al, 2020;Pfeiffer et al, 2020;Baziotis et al, 2020;Chronopoulou et al, 2020;Lin et al, 2020;Thompson et al, 2018;Luong et al, 2014;Dou et al, 2020).…”

Section: Machine Polyglotism and Pretrainingmentioning

confidence: 99%

Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

2021

View full text Add to dashboard Cite

SIGTYP 2021 is the third edition of the workshop for typology-related research and its integration into multilingual Natural Language Processing (NLP). The workshop is co-located with the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021), which takes place virtually this year. Our workshop includes a shared task on robust language identification from speech.The final program of SIGTYP contains 4 keynote talks, 3 shared task papers, 10 archival papers, and 14 extended abstracts. This workshop would not have been possible without the contribution of its program committee, to whom we would like to express our gratitude. We should also thank Claire Bowern, Miryam de Lhoneux, Johannes Bjerva, and David Yarowsky for kindly accepting our invitation as invited speakers. The workshop is generously sponsored by Google.

show abstract

“…However, there are two issues in the train/dev/test splits used in . First, Ma et al (2019) and Dou et al (2020) find that some same sentence pairs exist between the training and test data. Second, randomly shuffle the bi-text data and split it into halves, which may bring more overlap than in natural monolingual data, i.e., bilingual sentences from a document are probably selected into monolingual data (e.g., one sentence on the source split and its translation on the target split).…”

Section: Setupmentioning

confidence: 99%

Iterative Domain-Repaired Back-Translation

Wei¹,

Zhang²,

Chen³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

In this paper, we focus on the domain-specific translation with low resources, where indomain parallel corpora are scarce or nonexistent. One common and effective strategy for this case is exploiting in-domain monolingual data with the back-translation method. However, the synthetic parallel data is very noisy because they are generated by imperfect out-of-domain systems, resulting in the poor performance of domain adaptation. To address this issue, we propose a novel iterative domain-repaired back-translation framework, which introduces the Domain-Repair (DR) model to refine translations in synthetic bilingual data. To this end, we construct corresponding data for the DR model training by round-trip translating the monolingual sentences, and then design the unified training framework to optimize paired DR and NMT models jointly. Experiments on adapting NMT models between specific domains and from the general domain to specific domains demonstrate the effectiveness of our proposed approach, achieving 15.79 and 4.47 BLEU improvements on average over unadapted models and back-translation. 1

show abstract

Dynamic Data Selection and Weighting for Iterative Back-Translation

Cited by 37 publications

References 29 publications

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Text-based Translation

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Text-based Translation

Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

Iterative Domain-Repaired Back-Translation

Contact Info

Product

Resources

About