2022
DOI: 10.48550/arxiv.2205.03983
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Building Machine Translation Systems for the Next Thousand Languages

Abstract: In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(9 citation statements)
references
References 57 publications
(80 reference statements)
0
8
0
Order By: Relevance
“…Web-NTL: For pre-training with unlabeled text, we use a web-crawled corpus of monolingual text containing over 28B sentences [76]. The dataset spans 1140 languages, 205 of which have over 1M sentences and 199 of which have between 100k and 1M sentences.…”
Section: Text Datamentioning
confidence: 99%
See 1 more Smart Citation
“…Web-NTL: For pre-training with unlabeled text, we use a web-crawled corpus of monolingual text containing over 28B sentences [76]. The dataset spans 1140 languages, 205 of which have over 1M sentences and 199 of which have between 100k and 1M sentences.…”
Section: Text Datamentioning
confidence: 99%
“…We up-sample lower resource languages using temperature-based sampling [77] with T = 3.0. More details about the dataset and the mining approach have been described in Section 2 of [76].…”
Section: Text Datamentioning
confidence: 99%
“…Machine translation has rapidly advanced in the past few years, expanding its scope to most languages. Meta's No Language Left Behind translates 200 different languages with high-quality results (Meta, 2022), and Google Translate, as of 2022, supports 133 languages, including 24 low-resource languages (Bapna, 2022). OpenAI's GPT models also emerge as excellent translators by generating context-relevant translation (Hendy et al, 2023).…”
Section: Automated Scoringmentioning
confidence: 99%
“…A common approach is to expand the vocabulary and the word embedding matrix to contain the extra tokens. However, the incoming embeddings must be randomly initialized (Garcia et al, 2021;Bapna et al, 2022), which leads to inconsistency with the original embeddings and increases training difficulty. Fortunately, Rikters et al (2022) has released a translation model for En-Liv called Liv4ever-MT 4 .…”
Section: Introductionmentioning
confidence: 99%