Building Machine Translation Systems for the Next Thousand Languages

Bapna, Ankur; Caswell, Isaac; Kreutzer, Julia; Fırat, Orhan; Esch, Daan van; Siddhant, Aditya; Niu, Mengmeng; Baljekar, Pallavi; García, Xavier; Macherey, Wolfgang; Breiner, Theresa; Riesa, Jason; Chen, Yuan; Chen, Mia Xu; Macherey, Klaus; Krikun, Maxim; Wang, Pidong; Gutkin, Alexander; Shah, Apurva S.; Huang, Y.; Chen, Zhifeng; Wu, Yangjie; Hughes, Macduff

doi:10.48550/arxiv.2205.03983

Cited by 10 publications

(9 citation statements)

References 57 publications

(80 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Web-NTL: For pre-training with unlabeled text, we use a web-crawled corpus of monolingual text containing over 28B sentences [76]. The dataset spans 1140 languages, 205 of which have over 1M sentences and 199 of which have between 100k and 1M sentences.…”

Section: Text Datamentioning

confidence: 99%

See 1 more Smart Citation

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Park

Han

Qin

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model [1], our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

show abstract

Section: Text Datamentioning

confidence: 99%

“…We up-sample lower resource languages using temperature-based sampling [77] with T = 3.0. More details about the dataset and the mining approach have been described in Section 2 of [76].…”

Section: Text Datamentioning

confidence: 99%

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Park

Han

Qin

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

show abstract

“…Machine translation has rapidly advanced in the past few years, expanding its scope to most languages. Meta's No Language Left Behind translates 200 different languages with high-quality results (Meta, 2022), and Google Translate, as of 2022, supports 133 languages, including 24 low-resource languages (Bapna, 2022). OpenAI's GPT models also emerge as excellent translators by generating context-relevant translation (Hendy et al, 2023).…”

Section: Automated Scoringmentioning

confidence: 99%

Transforming Assessment: The Impacts and Implications of Large Language Models and Generative AI

Hao,

von Davier,

Yaneva

et al. 2024

Educational Measurement

View full text Add to dashboard Cite

The remarkable strides in artificial intelligence (AI), exemplified by ChatGPT, have unveiled a wealth of opportunities and challenges in assessment. Applying cutting‐edge large language models (LLMs) and generative AI to assessment holds great promise in boosting efficiency, mitigating bias, and facilitating customized evaluations. Conversely, these innovations raise significant concerns regarding validity, reliability, transparency, fairness, equity, and test security, necessitating careful thinking when applying them in assessments. In this article, we discuss the impacts and implications of LLMs and generative AI on critical dimensions of assessment with example use cases and call for a community effort to equip assessment professionals with the needed AI literacy to harness the potential effectively.

show abstract

“…A common approach is to expand the vocabulary and the word embedding matrix to contain the extra tokens. However, the incoming embeddings must be randomly initialized (Garcia et al, 2021;Bapna et al, 2022), which leads to inconsistency with the original embeddings and increases training difficulty. Fortunately, Rikters et al (2022) has released a translation model for En-Liv called Liv4ever-MT 4 .…”

Section: Introductionmentioning

confidence: 99%

Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation System for the WMT22 Translation Task

He¹,

Wang²,

Tu³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper describes Tencent AI Lab -Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task. We participate in the general translation task on English⇔Livonian. Our system is based on M2M100 with novel techniques that adapt it to the target language pair. (1) Cross-model word embedding alignment: inspired by cross-lingual word embedding alignment, we successfully transfer a pre-trained word embedding to M2M100, enabling it to support Livonian. (2) Gradual adaptation strategy: we exploit Estonian and Latvian as auxiliary languages for many-tomany translation training and then adapt to English-Livonian. (3) Data augmentation: to enlarge the parallel data for English-Livonian, we construct pseudo-parallel data with Estonian and Latvian as pivot languages. (4) Finetuning: to make the most of all available data, we fine-tune the model with the validation set and online back-translation, further boosting the performance. In model evaluation: (1) We find that previous work (Rikters et al., 2022) underestimated the translation performance of Livonian due to inconsistent Unicode normalization, which may cause a discrepancy of up to 14.9 BLEU score. (2) In addition to the standard validation set, we also employ round-trip BLEU to evaluate the models, which we find more appropriate for this task. Finally, our unconstrained system achieves BLEU scores of 17.0 and 30.4 for English to/from Livonian. 1 * Work was done when Zhiwei He was interning at Tencent AI Lab.† Xing Wang is the corresponding author. 1 Code, data, and trained models are available at https: //github.com/zwhe99/WMT22-En-Liv.2 https://github.com/facebookresearch/fairseq/ tree/main/examples/m2m_100 3 M2M100 supports English, Latvian and Estonian. 4 https://huggingface.co/tartuNLP/liv4ever-mt

show abstract

Building Machine Translation Systems for the Next Thousand Languages

Cited by 10 publications

References 57 publications

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Transforming Assessment: The Impacts and Implications of Large Language Models and Generative AI

Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation System for the WMT22 Translation Task

Contact Info

Product

Resources

About