Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.351
|View full text |Cite
|
Sign up to set email alerts
|

Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri

Abstract: This paper presents a neural machine translation model and dataset for the Chibchan language Bribri, with an average performance of BLEU 16.9±1.7. This was trained on an extremely small dataset (5923 Bribri-Spanish pairs), providing evidence for the applicability of NMT in extremely low-resource environments. We discuss the challenges entailed in managing training input from languages without standard orthographies, we provide evidence of successful learning of Bribri grammar, and also examine the translations… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 19 publications
(17 citation statements)
references
References 22 publications
(15 reference statements)
0
15
0
Order By: Relevance
“…The training set for Bribri was extracted from six sources (Feldman and Coto-Solano, 2020;Margery, 2005;Jara Murillo, 2018a;Constenla et al, 2004;Jara Murillo and García Segura, 2013;Jara Murillo, 2018b;Flores Solórzano, 2017), including a dictionary, a grammar, two language learning textbooks, one storybook and the transcribed sentences from one spoken corpus. The sentences belong to three major dialects: Amubri, Coroma and Salitre.…”
Section: Training Datamentioning
confidence: 99%
See 1 more Smart Citation
“…The training set for Bribri was extracted from six sources (Feldman and Coto-Solano, 2020;Margery, 2005;Jara Murillo, 2018a;Constenla et al, 2004;Jara Murillo and García Segura, 2013;Jara Murillo, 2018b;Flores Solórzano, 2017), including a dictionary, a grammar, two language learning textbooks, one storybook and the transcribed sentences from one spoken corpus. The sentences belong to three major dialects: Amubri, Coroma and Salitre.…”
Section: Training Datamentioning
confidence: 99%
“…There are numerous sources of variation in the Bribri data (Feldman and Coto-Solano, 2020): 1) There are several different orthographies, which use different diacritics for the same words. 2) The Unicode encoding of visually similar diacritics differs among authors.…”
Section: Training Datamentioning
confidence: 99%
“…Bribri The Bribri-Spanish data (Feldman and Coto-Solano, 2020) came from six different sources (a dictionary, a grammar, two language learning textbooks, one storybook, and transcribed sentences from a spoken corpus) and three major dialects (Amubri, Coroma, and Salitre). Two different orthographies are widely used for Bribri, so an intermediate representation was used to facilitate training.…”
Section: Parallel Datamentioning
confidence: 99%
“…There has been some work on Bribri NLP, including the creation of digital dictionaries (Krohn, 2020) and morphological analyzers used for documentation (Flores Solórzano, 2019, 2017b. There have also been some experiments with untrained forced alignment (Coto-Solano andFlores Solórzano, 2016, 2017), and with neural machine translation (Feldman and Coto-Solano, 2020). However, there is a need to accelerate the documentation of Bribri and produce more written materials out of existing recordings, and here we face the bottleneck problem mentioned above.…”
Section: Chibchan Languages and Bribrimentioning
confidence: 99%