Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information

Lin, Zhiping; Pan, Xiao; Wang, Mingxuan; Qiu, Xipeng; Feng, Jing; Zhou, Hao; Li, Lei

doi:10.48550/arxiv.2010.03142

Cited by 38 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some explores data symmetry (Freitag and Firat, 2020;Birch et al, 2008;Lin et al, 2019). Zero-shot translation in severely low resource settings exploits the massive multilinguality, cross-lingual transfer, pretraining, iterative backtranslation and freezing subnetworks (Lauscher et al, 2020;Nooralahzadeh et al, 2020;Pfeiffer et al, 2020;Baziotis et al, 2020;Chronopoulou et al, 2020;Lin et al, 2020;Thompson et al, 2018;Luong et al, 2014;Dou et al, 2020).…”

Section: Machine Polyglotism and Pretrainingmentioning

confidence: 99%

“…We find that using many languages that are distant to the target low resource language may produce marginal improvements, if not negative impact. Indeed, existing literature on zero-shot translation also suffers from the limitation of linguistic distance between the source languages and the target language (Lauscher et al, 2020;Lin et al, 2020;Pfeiffer et al, 2020). We therefore rank and select the top few source languages that are closer to the target low resource language using the two metrics below.…”

Section: Ranking Source Languagesmentioning

confidence: 99%

“…Instead, we examine ways to pick useful source languages from 124 source languages in a principled fashion. Secondly, most works require at least 4,000 lines of low resource data (Lin et al, 2020;Qi et al, 2018;Zhou et al, 2018a); we use only ∼1,000 lines of low resource data to simulate real-life situation of having ex-tremely small seed target translation. Thirdly, many works use rich resource languages as hypothetical low resource languages.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

Zhou¹,

Waibel²

2021

Preprint

View full text Add to dashboard Cite

We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. In other words, given a text in 124 source languages, we translate it into a severely low resource language using only ∼1,000 lines of low resource data without any external help. Firstly, we propose a systematic method to rank and choose source languages that are close to the low resource language. We call the linguistic definition of language family Family of Origin (FAMO), and we call the empirical definition of higher-ranked languages using our metrics Family of Choice (FAMC). Secondly, we build an Iteratively Pretrained Multilingual Orderpreserving Lexiconized Transformer (IPML) to train on ∼1,000 lines (∼3.5%) of low resource data. To translate named entities correctly, we build a massive lexicon table for 2,939 Bible named entities in 124 source languages, and include many that occur once and covers more than 66 severely low resource languages. Moreover, we also build a novel method of combining translations from different source languages into one. Using English as a hypothetical low resource language, we get a +23.9 BLEU increase over a multilingual baseline, and a +10.3 BLEU increase over our asymmetric baseline in the Bible dataset. We get a 42.8 BLEU score for Portuguese-English translation on the medical EMEA dataset. We also have good results for a real severely low resource Mayan language, Eastern Pokomchi.

show abstract

Section: Machine Polyglotism and Pretrainingmentioning

confidence: 99%

Section: Ranking Source Languagesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

Zhou¹,

Waibel²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Some use data selection for active learning (Eck et al, 2005). Some use as few as ∼4,000 lines (Lin et al, 2020;Qi et al, 2018) and ∼1,000 lines (Zhou and Waibel, 2021) of data. Some do not use low resource data (Neubig and Hu, 2018;Karakanta et al, 2018).…”

Section: Severely Low Resource Text-based Translationmentioning

confidence: 99%

“…Given a closed text that has many existing translations in different languages, we are interested in translating it into a severely low resource language well. Researchers recently have shown achievements in translation using very small seed parallel corpora in low resource languages (Lin et al, 2020;Qi et al, 2018;Zhou et al, 2018a). Construction methods of such seed corpora are therefore pivotal in translation performance.…”

Section: Introductionmentioning

confidence: 99%

Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages

Zhou¹,

Waibel²

2021

Preprint

View full text Add to dashboard Cite

We translate a closed text that is known in advance and available in many languages into a new and severely low resource language. Most human translation efforts adopt a portionbased approach to translate consecutive pages/chapters in order, which may not suit machine translation. We compare the portion-based approach that optimizes coherence of the text locally with the random sampling approach that increases coverage of the text globally. Our results show that the random sampling approach performs better. When training on a seed corpus of ∼1,000 lines from the Bible and testing on the rest of the Bible (∼30,000 lines), random sampling gives a performance gain of +11.0 BLEU using English as a simulated low resource language, and +4.9 BLEU using Eastern Pokomchi, a Mayan language. Furthermore, we compare three ways of updating machine translation models with increasing amount of human post-edited data through iterations. We find that adding newly post-edited data to training after vocabulary update without self-supervision performs the best. We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language.

show abstract

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation

Pan¹,

Wang²,

Wu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

Existing multilingual machine translation approaches mainly focus on English-centric directions, while the non-English directions still lag behind. In this work, we aim to build a many-to-many translation system with an emphasis on the quality of non-English language directions. Our intuition is based on the hypothesis that a universal cross-language representation leads to better multilingual translation performance. To this end, we propose mRASP2, a training method to obtain a single unified multilingual translation model. mRASP2 is empowered by two techniques: a) a contrastive learning scheme to close the gap among representations of different languages, and b) data augmentation on both multiple parallel and monolingual data to further align token representations. For English-centric directions, mRASP2 outperforms existing best unified model and achieves competitive or even better performance than the pre-trained and fine-tuned model mBART on tens of WMT's translation directions. For non-English directions, mRASP2 achieves an improvement of average 10+ BLEU compared with the multilingual Transformer baseline. Code, data and trained models are available at https://github. com/PANXiao1994/mRASP2.

show abstract

Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information

Cited by 38 publications

References 16 publications

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation

Contact Info

Product

Resources

About