Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

Arivazhagan, Naveen; Bapna, Ankur; Fırat, Orhan; Lepikhin, Dmitry; Johnson, Melvin; Krikun, Maxim; Chen, Mia Xu; Cao, Yuan; Foster, George; Cherry, Colin; Macherey, Wolfgang; Chen, Zhifeng; Wu, Yonghui

doi:10.48550/arxiv.1907.05019

Cited by 140 publications

(198 citation statements)

References 104 publications

Supporting

Mentioning

168

Contrasting

Order By: Relevance

“…DOCmT5-5 significantly outperforms Doc-NMT and DocTLM, showing that our proposed pretraining objective leads to improved cross-lingual learning. The results of DOCmT5-25 are inferior to DOCmT5-5 and this is possibly due to capacity dilution (Arivazhagan et al, 2019). As we increase the capacity, we see that DOCmT5-25-Large outperforms DOCmT5-5-Large.…”

Section: Results On Seen Language Pairsmentioning

confidence: 79%

DOCmT5: Document-Level Pretraining of Multilingual Language Models

Chia-Hsuan¹,

Siddhant²,

Ratnakar³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents. While previous approaches have focused on leveraging sentence-level parallel data, we try to build a general-purpose pre-trained model that can understand and generate long documents. We propose a simple and effective pre-training objective -Document reordering Machine Translation (DrMT), in which the input documents that are shuffled and masked need to be translated. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks, including over 12 BLEU points for seen-languagepair document-level MT, over 7 BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1 points for seen-languagepair cross-lingual summarization. We achieve state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation tasks. We also conduct extensive analysis on various factors for document pre-training, including (1) the effects of pre-training data quality and (2) The effects of combining monolingual and cross-lingual pre-training. We plan to make our model checkpoints publicly available.

show abstract

Section: Results On Seen Language Pairsmentioning

confidence: 79%

DOCmT5: Document-Level Pretraining of Multilingual Language Models

Chia-Hsuan¹,

Siddhant²,

Ratnakar³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In spite of the aforementioned near-human results on translation or understanding of languages from the world's economic and political superpowers, the experience of any NLP practicioner is that, for the vast majority of languages, they fall far below such standards. Critically, the languages of the world showcase substantial amounts of variation in most domains of description, and in fact, the performance of language technologies has been shown to be sensitive to diverse aspects of the language under study, including morphology, word order, or phonological repertoire, as well as more mundane aspects like data availability (Tsarfaty et al, 2020;Xia et al, 2020;Arivazhagan et al, 2019). Hence, the transfer of NLP developments from one language to another is far from trivial, as it often means that building highly functional language technologies on any particular language is a non-automatic, costly, and technically challenging task.…”

Section: Introductionmentioning

confidence: 99%

Systematic Inequalities in Language Technology Performance across the World's Languages

Blasí¹,

Anastasopoulos²,

Neubig³

2021

Preprint

View full text Add to dashboard Cite

Natural language processing (NLP) systems have become a central technology in communication, education, medicine, artificial intelligence, and many other domains of research and development. While the performance of NLP methods has grown enormously over the last decade, this progress has been restricted to a minuscule subset of the world's 6,500 languages. We introduce a framework for estimating the global utility of language technologies as revealed in a comprehensive snapshot of recent publications in NLP. Our analyses involve the field at large, but also more in-depth studies on both user-facing technologies (machine translation, language understanding, question answering, text-to-speech synthesis) as well as more linguistic NLP tasks (dependency parsing, morphological inflection). In the process, we (1) quantify disparities in the current state of NLP research, (2) explore some of its associated societal and academic factors, and (3) produce tailored recommendations for evidencebased policy making aimed at promoting more global and equitable language technologies. 1

show abstract

“…Multilingual learning has the potential of crosslingual transfer, allowing low-resource languages to benefit from high-resource data when trained together (Conneau et al, 2019). However, in practice, this positive transfer is often mitigated by interference between languages (Arivazhagan et al, 2019;Tan et al, 2019;Zhang et al, 2020). This is because all languages, irrespective of the amount of data, are trained with a fixed model capacity , leading to insufficient specialized capacity.…”

Section: Introductionmentioning

confidence: 99%

“…We propose two straightforward techniques to improve BASELayers-based sparse architectures (Lewis et al, 2021) for multitask learning: first, we slowly ramp the number of instances from low-resource tasks over epochs rather than having a fixed sampling ratio (Arivazhagan et al, 2019). This promotes cross-lingual transfer and reduces over-fitting as the model witnesses low-resource task instances in the later epochs.…”

Section: Introductionmentioning

confidence: 99%

Tricks for Training Sparse Translation Models

Dua¹,

Bhosale²,

Goswami³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-task learning with an unbalanced data distribution skews model learning towards high resource tasks, especially when model capacity is fixed and fully shared across all tasks. Sparse scaling architectures, such as BASE-Layers, provide flexible mechanisms for different tasks to have a variable number of parameters, which can be useful to counterbalance skewed data distributions. We find that that sparse architectures for multilingual machine translation can perform poorly out of the box, and propose two straightforward techniques to mitigate this -a temperature heating mechanism and dense pre-training. Overall, these methods improve performance on two multilingual translation benchmarks compared to standard BASELayers and Dense scaling baselines, and in combination, more than 2x model convergence speed.

show abstract

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

Cited by 140 publications

References 104 publications

DOCmT5: Document-Level Pretraining of Multilingual Language Models

DOCmT5: Document-Level Pretraining of Multilingual Language Models

Systematic Inequalities in Language Technology Performance across the World's Languages

Tricks for Training Sparse Translation Models

Contact Info

Product

Resources

About