A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Adelani, David Ifeoluwa; Alabi, Jesujoba Oluwadara; Fan, Angela; Kreutzer, Julia; Shen, Xiaoyu; Reid, Machel; Ruiter, Dana; Klakow, Dietrich; Nabende, Peter; Chang, Ernie; Tajuddeen, Gwadabe,; Sackey, Freshia; Dossou, Bonaventure F. P.; Emezue, Chris Chinenye; Leong, Colin; Beukman, Michael; Muhammad, Shamsuddeen Hassan; Jarso, Guyo Dub; Oreen, Yousuf,; Rubungo, Andre Niyongabo; Gilles, Hacheme,; Wairagala, Eric Peter; Umair, Nasir, Muhammad; Ajibade, Benjamin Ayoade; Oluwaseyi, Ajayi, Tunde; Gitau, Yvonne Wambui; Abbott, Jade; Ahmed, Mohamed; Millicent, Ochieng,; Aremu, Anuoluwapo; Perez, Ogayo,; Mukiibi, Jonathan; Ouoba, Kabore, Fatoumata; Kalipe, Godson Koffi; Mbaye, Derguene; Tapo, Allahsera Auguste; Memdjokam, Koagne, Victoire; Edwin, Munkoh-Buabeng,; Wagner, Valencia; Abdulmumin, Idris; Awokoya, Ayodele; Buzaaba, Happy; Andiswa, Bukula,; Manthalu, Sam

doi:10.18653/v1/2022.naacl-main.223

Cited by 10 publications

(8 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While Kreutzer et al (2022)'s audit of mC4 did not yield a significant amount of offensive content (0.06% of sentences they audited) and our web crawls mainly focused on verified news publications, these filters ensure that non-linguistic and offensive contents are removed at the passage level.…”

Section: Language Contaminationmentioning

confidence: 99%

“…Previous works have shown the correlation between the quality of the data used in pretraining a model and the performance of the trained model (Rae et al, 2021;Kreutzer et al, 2022;Hernandez et al, 2022). AfriTeVa V2's improvement over baselines in downstream tasks suggests that this is true.…”

Section: Impact Of Data Quality On Lmsmentioning

confidence: 99%

“…The introduction of mC4 (Xue et al, 2021), a document-level dataset spanning 101 languages helped alleviate this cover-age gap. 1 However, previous work (Kreutzer et al, 2022) has shown that mC4 and other existing largescale pretraining corpora have numerous quality issues, particularly for the low-resource African languages they contain.…”

Section: Introductionmentioning

confidence: 99%

“…Motivated by this need, we introduce a new multilingual pretraining corpus in 20 African languages. We draw from Kreutzer et al (2022)'s audit of existing pretraining corpora to understand prevailing quality issues. For mC4, they cite a high ratio both of sentences in incorrect languages (15.98% average) and nonlinguistic content (11.40% average).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Better Quality Pre-training Data and T5 Models for African Languages

Oladipo,

Adeyemi,

Ahia

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawls have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new multilingual pretraining corpus for 16 African languages, designed by carefully auditing existing pretraining corpora to understand and rectify prevalent quality issues. To compile this dataset, we undertake a rigorous examination of current data sources for thirteen languages within one of the most extensive multilingual web crawls, mC4, and extract cleaner data through meticulous auditing and improved web crawling strategies. Subsequently, we pretrain a new T5-based model on this dataset and evaluate its performance on multiple downstream tasks. Our model demonstrates better downstream effectiveness over existing pretrained models across four NLP tasks, underscoring the critical role data quality plays in pretraining language models in low-resource scenarios. Specifically, on cross-lingual QA evaluation, our new model is more than twice as effective as multilingual T5. All code, data and model are publicly available at https: //github.com/castorini/AfriTeVa-keji.

show abstract

Section: Language Contaminationmentioning

confidence: 99%

Section: Impact Of Data Quality On Lmsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Better Quality Pre-training Data and T5 Models for African Languages

Oladipo,

Adeyemi,

Ahia

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…If the quality of the translator is bad, the model can even underperform the FT baseline without any cross-lingual transfer. Considering that most low-resource languages do not have high-quality MT systems yet (Adelani et al, 2022), this further implies we should rather focus on translation-free approaches for this task.…”

Section: Cross-domainmentioning

confidence: 99%

Cross-lingual Sentiment Analysis via AAE and BiGRU

Shen

Liu

Shuai

2020

2020 Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC)

View full text Add to dashboard Cite

Cross-lingual transfer is important for developing high-quality chatbots in multiple languages due to the strongly imbalanced distribution of language resources. A typical approach is to leverage off-the-shelf machine translation (MT) systems to utilize either the training corpus or developed models from highresource languages. In this work, we investigate whether it is helpful to utilize MT at all in this task. To do so, we simulate a low-resource scenario assuming access to limited Chinese dialog data in the movie domain and large amounts of English dialog data from multiple domains. Experiments show that leveraging English dialog corpora can indeed improve the naturalness, relevance and cross-domain transferability in Chinese. However, directly using English dialog corpora in its original form, surprisingly, is better than using its translated version. As the topics and wording habits in daily conversations are strongly culture-dependent, MT can reinforce the bias from high-resource languages, yielding unnatural generations in the target language. Considering the cost of translating large amounts of text and the strong effects of the translation quality, we suggest future research should rather focus on utilizing the original English data for cross-lingual transfer in dialog generation. We perform extensive human evaluations and ablation studies. The analysis results, together with the collected dataset, are presented to draw attention towards this area and benefit future research 1 .

show abstract

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Nakatumba‐Nabende,

Babirye,

Nabende

et al. 2024

Applied AI Letters

Self Cite

View full text Add to dashboard Cite

Africa has over 2000 languages; however, those languages are not well represented in the existing natural language processing ecosystem. African languages lack essential digital resources to effectively engage in advancing language technologies. There is a need to generate high‐quality natural language processing resources for low‐resourced African languages. Obtaining high‐quality speech and text data is expensive and tedious because it can involve manual sourcing and verification of data sources. This paper discusses the process taken to curate and annotate text and speech datasets for five East African languages: Luganda, Runyankore‐Rukiga, Acholi, Lumasaba, and Swahili. We also present results obtained from baseline models for machine translation, topic modeling and classification, sentiment classification, and automatic speech recognition tasks. Finally, we discuss the experiences, challenges, and lessons learned in creating the text and speech datasets.

show abstract

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Cited by 10 publications

References 15 publications

Better Quality Pre-training Data and T5 Models for African Languages

Better Quality Pre-training Data and T5 Models for African Languages

Cross-lingual Sentiment Analysis via AAE and BiGRU

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Contact Info

Product

Resources

About