Cost-Effective Training in Low-Resource Neural Machine Translation

Koneru, Sai; Liu, Danni; Niehues, Jan

doi:10.48550/arxiv.2201.05700

Cited by 2 publications

(3 citation statements)

References 17 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Random sampling is often surprisingly powerful (Kendall and Smith, 1938;Knuth, 1991;Sennrich et al, 2016a). There is extensive research to beat random sampling by methods based on entropy (Koneru et al, 2022), coverage and uncertainty (Peris and Casacuberta, 2018;Zhao et al, 2020), clustering Gangadharaiah et al, 2009), consensus , syntactic parsing (Miura et al, 2016), density and diversity (Koneru et al, 2022;Ambati et al, 2011), and learning to learn active learning strategies (Liu et al, 2018).…”

Section: Active Learning In Machine Translationmentioning

confidence: 99%

See 1 more Smart Citation

Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

Zhou,

Niehues,

Waibel

2023

Proceedings of the the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)

View full text Add to dashboard Cite

In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, endangered languages may be possible and reduce human translation effort. We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality for a well known text, which is available in multiple languages, in a new, severely low resource language. We examine two approaches: 1.) best selection of seed sentences to jump start translations in a new language in view of best generalization to the remainder of a larger targeted text(s), and 2.) we adapt large general multilingual translation engines from many other languages to focus on a specific text in a new, unknown language. We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best. If we also select a best set of seed sentences, we can improve average chrF performance on new test languages from a baseline of 21.9 to 50.7, while reducing the number of seed sentences to only ∼1,000 in the new, unknown language.

show abstract

Section: Active Learning In Machine Translationmentioning

confidence: 99%

“…Let I l (•) and I r (•) be indicator functions to show whether a sentence belongs to the left or the right. We aim to maximize the diversity H c and optimize density by adjusting H l and H r (Koneru et al, 2022).…”

mentioning

confidence: 99%

Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

Zhou,

Niehues,

Waibel

2023

Proceedings of the the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)

View full text Add to dashboard Cite

show abstract

“…Another work that utilizes backtranslation for effecctive NMT training is done by Dou et al (2020). Koneru et al (2022) proposes a cost-effective training procedure to increase the performance of models on NMT tasks, utilizing a small number of annotated sentences and dictionary entries. Park et al (2020) looked into decoding strategies for low-resourced languages in an attempt to improve training.…”

Section: Introductionmentioning

confidence: 99%

Geographical Distance Is The New Hyperparameter: A Case Study Of Finding The Optimal Pre-trained Language For English-isiZulu Machine Translation

Umair¹,

Mchechesi²

2022

Preprint

View full text Add to dashboard Cite

Stemming from the limited availability of datasets and textual resources for low-resource languages such as isiZulu, there is a significant need to be able to harness knowledge from pretrained models to improve low resource machine translation. Moreover, a lack of techniques to handle the complexities of morphologically rich languages has compounded the unequal development of translation models, with many widely spoken African languages being left behind. This study explores the potential benefits of transfer learning in an English-isiZulu translation framework. The results indicate the value of transfer learning from closely related languages to enhance the performance of low-resource translation models, thus providing a key strategy for lowresource translation going forward. We gathered results from 8 different language corpora, including one multi-lingual corpus, and saw that isiXhosa-isiZulu outperformed all languages, with a BLEU score of 8.56 on the test set which was better from the multi-lingual corpora pre-trained model by 2.73. We also derived a new coefficient, Nasir's Geographical Distance Coefficient (NGDC) which provides an easy selection of languages for the pre-trained models. NGDC also indicated that isiXhosa should be selected as the language for the pre-trained model.

show abstract

Cost-Effective Training in Low-Resource Neural Machine Translation

Cited by 2 publications

References 17 publications

Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

Geographical Distance Is The New Hyperparameter: A Case Study Of Finding The Optimal Pre-trained Language For English-isiZulu Machine Translation

Contact Info

Product

Resources

About