Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2022
DOI: 10.18653/v1/2022.naacl-main.223
|View full text |Cite
|
Sign up to set email alerts
|

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Abstract: Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to opti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 15 publications
0
1
0
Order By: Relevance
“…While Kreutzer et al (2022)'s audit of mC4 did not yield a significant amount of offensive content (0.06% of sentences they audited) and our web crawls mainly focused on verified news publications, these filters ensure that non-linguistic and offensive contents are removed at the passage level.…”
Section: Language Contaminationmentioning
confidence: 99%
See 3 more Smart Citations
“…While Kreutzer et al (2022)'s audit of mC4 did not yield a significant amount of offensive content (0.06% of sentences they audited) and our web crawls mainly focused on verified news publications, these filters ensure that non-linguistic and offensive contents are removed at the passage level.…”
Section: Language Contaminationmentioning
confidence: 99%
“…Previous works have shown the correlation between the quality of the data used in pretraining a model and the performance of the trained model (Rae et al, 2021;Kreutzer et al, 2022;Hernandez et al, 2022). AfriTeVa V2's improvement over baselines in downstream tasks suggests that this is true.…”
Section: Impact Of Data Quality On Lmsmentioning
confidence: 99%
See 2 more Smart Citations
“…If the quality of the translator is bad, the model can even underperform the FT baseline without any cross-lingual transfer. Considering that most low-resource languages do not have high-quality MT systems yet (Adelani et al, 2022), this further implies we should rather focus on translation-free approaches for this task.…”
Section: Cross-domainmentioning
confidence: 99%