Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri

Feldman, Isaac; Coto-Solano, Rolando

doi:10.18653/v1/2020.coling-main.351

Cited by 19 publications

(17 citation statements)

References 22 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The training set for Bribri was extracted from six sources (Feldman and Coto-Solano, 2020;Margery, 2005;Jara Murillo, 2018a;Constenla et al, 2004;Jara Murillo and García Segura, 2013;Jara Murillo, 2018b;Flores Solórzano, 2017), including a dictionary, a grammar, two language learning textbooks, one storybook and the transcribed sentences from one spoken corpus. The sentences belong to three major dialects: Amubri, Coroma and Salitre.…”

Section: Training Datamentioning

confidence: 99%

See 1 more Smart Citation

Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas

Mager¹,

Oncevay²,

Ebrahimi³

et al. 2021

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Self Cite

View full text Add to dashboard Cite

This paper presents the results of the 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. The shared task featured two independent tracks, and participants submitted machine translation systems for up to 10 indigenous languages. Overall, 8 teams participated with a total of 214 submissions. We provided training sets consisting of data collected from various sources, as well as manually translated sentences for the development and test sets. An official baseline trained on this data was also provided. Team submissions featured a variety of architectures, including both statistical and neural models, and for the majority of languages, many teams were able to considerably improve over the baseline. The best performing systems achieved 12.97 ChrF higher than baseline, when averaged across languages.

show abstract

Section: Training Datamentioning

confidence: 99%

“…There are numerous sources of variation in the Bribri data (Feldman and Coto-Solano, 2020): 1) There are several different orthographies, which use different diacritics for the same words. 2) The Unicode encoding of visually similar diacritics differs among authors.…”

Section: Training Datamentioning

confidence: 99%

Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas

Mager¹,

Oncevay²,

Ebrahimi³

et al. 2021

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Self Cite

View full text Add to dashboard Cite

show abstract

“…Bribri The Bribri-Spanish data (Feldman and Coto-Solano, 2020) came from six different sources (a dictionary, a grammar, two language learning textbooks, one storybook, and transcribed sentences from a spoken corpus) and three major dialects (Amubri, Coroma, and Salitre). Two different orthographies are widely used for Bribri, so an intermediate representation was used to facilitate training.…”

Section: Parallel Datamentioning

confidence: 99%

Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

Zheng¹,

Reid²,

Marrese-Taylor³

et al. 2021

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

View full text Add to dashboard Cite

This paper describes UTokyo's submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a lowresource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of FAIRSEQ to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri. On average, our system achieved BLEU scores that were 1.64 higher and CHRF scores that were 0.0749 higher than the baseline.

show abstract

“…There has been some work on Bribri NLP, including the creation of digital dictionaries (Krohn, 2020) and morphological analyzers used for documentation (Flores Solórzano, 2019, 2017b. There have also been some experiments with untrained forced alignment (Coto-Solano andFlores Solórzano, 2016, 2017), and with neural machine translation (Feldman and Coto-Solano, 2020). However, there is a need to accelerate the documentation of Bribri and produce more written materials out of existing recordings, and here we face the bottleneck problem mentioned above.…”

Section: Chibchan Languages and Bribrimentioning

confidence: 99%

Explicit Tone Transcription Improves ASR Performance in Extremely Low-Resource Languages: A Case Study in Bribri

Coto-Solano¹

2021

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Self Cite

View full text Add to dashboard Cite

Linguistic tone is transcribed for input into ASR systems in numerous ways. This paper shows a systematic test of several transcription styles, using as an example the Chibchan language Bribri, an extremely low-resource language from Costa Rica. The most successful models separate the tone from the vowel, so that the ASR algorithms learn tone patterns independently. These models showed improvements ranging from 4% to 25% in character error rate (CER), and between 3% and 23% in word error rate (WER). This is true for both traditional GMM/HMM and end-to-end CTC algorithms. This paper also presents the first attempt to train ASR models for Bribri. The best performing models had a CER of 33% and a WER of 50%. Despite the disadvantage of using hand-engineered representations, these models were trained on only 68 minutes of data, and therefore show the potential of ASR to generate further training materials and aid in the documentation and revitalization of the language. ResumenTranscribir el tono de forma explícita mejora el rendimiento del reconocimiento de voz en idiomas extremadamente bajos en recursos: Un estudio de caso en bribri. Hay numerosas maneras de transcribir el tono lingüístico a la hora de proveer los datos de entrenamiento a los sistemas de reconocimiento de voz. Este artículo presenta un experimento sistemático de varias formas de transcripción usando como ejemplo la lengua chibcha bribri, una lengua de Costa Rica extremadamente baja en recursos. Los modelos más exitosos fueron aquellos en que el tono aparece separado de la vocal de tal forma que los algoritmos pudieran aprender los patrones tonales por separado. Estos modelos mostraron mejoras de entre 4% y 26% en el error de caracteres (CER), y de entre 3% y 25% en el error de palabras (WER). Esto se observó tanto en los algoritmos GMM/HMM como en los algoritmos CTC de secuenciaa-secuencia. Este artículo también presenta el primer intento de entrenar modelos de reconocimiento de voz en bribri. Los mejores modelos tuvieron un CER de 33% y un WER de 50%. A pesar de la desventaja de usar representaciones diseñadas a mano, estos modelos se entrenaron con solo 68 minutos de datos y muestran el potencial para generar más materiales de entrenamiento, así como de ayudar con la documentación y revitalización de la lengua.

show abstract

Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri

Cited by 19 publications

References 22 publications

Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas

Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas

Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

Explicit Tone Transcription Improves ASR Performance in Extremely Low-Resource Languages: A Case Study in Bribri

Contact Info

Product

Resources

About