Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Nekoto, Wilhelmina; Marivate, Vukosi; Matsila, Tshinondiwa; Fasubaa, Timi; Fagbohungbe, Taiwo; Akinola, Solomon Oluwole; Muhammad, Shamsuddeen Hassan; Kabongo, Salomon; Osei, Salomey; Freshia, Sackey; Niyongabo, Rubungo Andre; Macharm, Ricky; Perez, Ogayo,; Ahia, Orevaoghene; Meressa, Musie; Adeyemi, Mofe; Mokgesi-Selinga, Masabata; Okegbemi, Lawrence; Martinus, Laura; Tajudeen, Kolawole; Degila, Kevin; Ogueji, Kelechi; Siminyu, Kathleen; Kreutzer, Julia; Webster, Jason; Ali, Jamiil Toure; Abbott, Jade; Orife, Iroro; Ezeani, Ignatius; Dangana, Idris Abdulkabir; Kamper, Herman; Elsahar, Hady; Duru, Goodness; Kioko, Ghollah; Murhabazi, Espoir; Biljon, Elan van; Whitenack, Daniel; Onyefuluchi, Christopher; Emezue, Chris Chinenye; Dossou, Bonaventure F. P.; Sibanda, Blessing; Bassey, Blessing Itoro; Olabiyi, Ayodele; Ramkilowan, Arshath; Öktem, Alp; Akinfaderin, Adewale; Bashir, Abdallah

doi:10.18653/v1/2020.findings-emnlp.195

Cited by 47 publications

(46 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multilingual models show better performance in languages that are similar to the highest-resource languages in their training data, and it has been shown that languages in multilingual models compete for model parameters, making it unclear how much variation can fit in a single model [Wang et al 2020d]. A salient issue stems from the data that we use to train multilingual foundation models: in many multilingual corpora, English data is not only orders of magnitude more abundant than that of lower-resource languages, but it is often cleaner, broader, and contains examples showcasing more linguistic depth and complexity [Caswell et al 2021] (see Nekoto et al [2020] on building participatory and robust multilingual datasets). However, the answer does not simply lie in creating more balanced corpora: there are so many axes of language variation that it would be infeasible to create a corpus that is balanced and representative in all regards.…”

Section: Language Variation and Multilingualitymentioning

confidence: 99%

On the Opportunities and Risks of Foundation Models

Bommasani¹,

Hudson²,

Adeli³

et al. 2021

Preprint

624

652

View full text Add to dashboard Cite

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles (e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities, and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

show abstract

Section: Language Variation and Multilingualitymentioning

confidence: 99%

On the Opportunities and Risks of Foundation Models

Bommasani¹,

Hudson²,

Adeli³

et al. 2021

Preprint

624

652

View full text Add to dashboard Cite

show abstract

“…It consists of both dataset building and the development of the standardized code and also focuses on training a new generation of enthusiasts to carry forward the work. One of the prominent examples is the Masakhane project [120], which aims to put the Africa AI, specifically African language MT, into the world map. Within about two years, the Masakhane community has covered more than 38 African languages and resulted in multiple publications [120].…”

Section: Resultsmentioning

confidence: 99%

Neural Machine Translation for Low-Resource Languages: A Survey

Ranathunga¹,

Lee²,

Skënduli³

et al. 2021

Preprint

View full text Add to dashboard Cite

Neural Machine Translation (NMT) has seen a tremendous spurt of growth in less than ten years, and has already entered a mature phase. While considered as the most widely used solution for Machine Translation, its performance on low-resource language pairs still remains sub-optimal compared to the high-resource counterparts, due to the unavailability of large parallel corpora. Therefore, the implementation of NMT techniques for low-resource language pairs has been receiving the spotlight in the recent NMT research arena, thus leading to a substantial amount of research reported on this topic. This paper presents a detailed survey of research advancements in low-resource language NMT (LRL-NMT), along with a quantitative analysis aimed at identifying the most popular solutions. Based on our findings from reviewing previous work, this survey paper provides a set of guidelines to select the possible NMT technique for a given LRL data setting. It also presents a holistic view of the LRL-NMT research landscape and provides a list of recommendations to further enhance the research efforts on LRL-NMT.

show abstract

“…This creates a potentially vicious cycle of influence, which is important to break or at the least accounted for while designing LT4SG. For instance, Nekoto et al (2020) identified that many stakeholders in the process of low-resource Machine Translation were missing in- valuable language and societal knowledge, or the necessary technical resources, knowledge, connections, and incentives to form interactions with other stakeholders in the process. Due to limited knowledge and experience of technology, individuals may not consider the full range of costs and benefits while choosing an optimal language technology but rather, choose an option that fulfils their adequacy criteria (Campitelli and Gobet 2010).…”

Section: Laying the Foundationmentioning

confidence: 99%

“…They have extremely successful in doing so and have connected agents involved in language technology across the world. Nekoto et al (2020) used participatory research to identify and involve all necessary agents required in the Machine Translation development process. They identified missing interactions between content creators and data curators leading to noisy translation pairs, and between stakeholders and evaluators leading to unsuitable evaluation metrics.…”

Section: Case Studiesmentioning

confidence: 99%

Designing Language Technologies for Social Good: The Road not Taken

Mukhija¹,

Choudhury²,

Bali³

2021

Preprint

View full text Add to dashboard Cite

Development of speech and language technology for social good (LT4SG), especially those targeted at the welfare of marginalized communities and speakers of low-resource and under-served languages, has been a prominent theme of research within NLP, Speech and the AI communities. Researchers have mostly relied on their individual expertise, experiences or ad hoc surveys for prioritization of language technologies that provide social good to the end-users. This has been criticized by several scholars who argue that work on LT4SG must include the target linguistic communities during the design and development process. However, none of the LT4SG work and their critiques suggest principled techniques for prioritization of the technologies and methods for inclusion of the end-user during the development cycle. Drawing inspiration from the fields of Economics, Ethics, Psychology and Participatory Design, here we chart out a set of methodologies for prioritizing LT4SG that are aligned with the end-user preferences. We then analyze several LT4SG efforts in light of the proposed methodologies and bring out their hidden assumptions and potential pitfalls. While the current study is limited to language technologies, we believe that the principles and prioritization techniques highlighted here are applicable more broadly to AI for Social Good.

show abstract

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Cited by 47 publications

References 40 publications

On the Opportunities and Risks of Foundation Models

On the Opportunities and Risks of Foundation Models

Neural Machine Translation for Low-Resource Languages: A Survey

Designing Language Technologies for Social Good: The Road not Taken

Contact Info

Product

Resources

About