Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.195
|View full text |Cite
|
Sign up to set email alerts
|

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Lowresourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. * ∀ to represent the whole Masakhane community.As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
36
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 47 publications
(46 citation statements)
references
References 40 publications
0
36
0
Order By: Relevance
“…Multilingual models show better performance in languages that are similar to the highest-resource languages in their training data, and it has been shown that languages in multilingual models compete for model parameters, making it unclear how much variation can fit in a single model [Wang et al 2020d]. A salient issue stems from the data that we use to train multilingual foundation models: in many multilingual corpora, English data is not only orders of magnitude more abundant than that of lower-resource languages, but it is often cleaner, broader, and contains examples showcasing more linguistic depth and complexity [Caswell et al 2021] (see Nekoto et al [2020] on building participatory and robust multilingual datasets). However, the answer does not simply lie in creating more balanced corpora: there are so many axes of language variation that it would be infeasible to create a corpus that is balanced and representative in all regards.…”
Section: Language Variation and Multilingualitymentioning
confidence: 99%
“…Multilingual models show better performance in languages that are similar to the highest-resource languages in their training data, and it has been shown that languages in multilingual models compete for model parameters, making it unclear how much variation can fit in a single model [Wang et al 2020d]. A salient issue stems from the data that we use to train multilingual foundation models: in many multilingual corpora, English data is not only orders of magnitude more abundant than that of lower-resource languages, but it is often cleaner, broader, and contains examples showcasing more linguistic depth and complexity [Caswell et al 2021] (see Nekoto et al [2020] on building participatory and robust multilingual datasets). However, the answer does not simply lie in creating more balanced corpora: there are so many axes of language variation that it would be infeasible to create a corpus that is balanced and representative in all regards.…”
Section: Language Variation and Multilingualitymentioning
confidence: 99%
“…It consists of both dataset building and the development of the standardized code and also focuses on training a new generation of enthusiasts to carry forward the work. One of the prominent examples is the Masakhane project [120], which aims to put the Africa AI, specifically African language MT, into the world map. Within about two years, the Masakhane community has covered more than 38 African languages and resulted in multiple publications [120].…”
Section: Resultsmentioning
confidence: 99%
“…This creates a potentially vicious cycle of influence, which is important to break or at the least accounted for while designing LT4SG. For instance, Nekoto et al (2020) identified that many stakeholders in the process of low-resource Machine Translation were missing in- valuable language and societal knowledge, or the necessary technical resources, knowledge, connections, and incentives to form interactions with other stakeholders in the process. Due to limited knowledge and experience of technology, individuals may not consider the full range of costs and benefits while choosing an optimal language technology but rather, choose an option that fulfils their adequacy criteria (Campitelli and Gobet 2010).…”
Section: Laying the Foundationmentioning
confidence: 99%
“…They have extremely successful in doing so and have connected agents involved in language technology across the world. Nekoto et al (2020) used participatory research to identify and involve all necessary agents required in the Machine Translation development process. They identified missing interactions between content creators and data curators leading to noisy translation pairs, and between stakeholders and evaluators leading to unsuitable evaluation metrics.…”
Section: Case Studiesmentioning
confidence: 99%