Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.435
|View full text |Cite
|
Sign up to set email alerts
|

AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages

Abstract: Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot setting, even for languages unseen during pretraining. However, prior work evaluating performance on unseen languages has largely been limited to low-level, syntactic tasks, and it remains unclear if zero-shot learning of high-level, semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, an extension of XNLI (Conneau et al., 2018) to 10 Indigenous languages of the Americas. We co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 34 publications
(29 citation statements)
references
References 48 publications
(39 reference statements)
0
16
0
Order By: Relevance
“…For POS and DP, we sample ten low-resource languages from the Universal Dependencies (UD) 2.7 dataset (Zeman et al, 2020), taking into account: 1) the availability and the size of the corresponding Wikipedia; and 2) typological diversity to ensure that different language families are covered. 3 For NLI, we rely on the recent AmericasNLI dataset (Ebrahimi et al, 2022), spanning ten low-resource languages from the Americas. For AmericasNLI languages, we use Wikipedia if available; otherwise we use the unlabelled data previously used by Ansell et al (2022).…”
Section: Experiments and Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…For POS and DP, we sample ten low-resource languages from the Universal Dependencies (UD) 2.7 dataset (Zeman et al, 2020), taking into account: 1) the availability and the size of the corresponding Wikipedia; and 2) typological diversity to ensure that different language families are covered. 3 For NLI, we rely on the recent AmericasNLI dataset (Ebrahimi et al, 2022), spanning ten low-resource languages from the Americas. For AmericasNLI languages, we use Wikipedia if available; otherwise we use the unlabelled data previously used by Ansell et al (2022).…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…The adapter reduction factor (Pfeiffer et al, 2020a) is 2 for LAs and BAs and 16 for TAs. For AmericasNLI, we train its TA using the English MultiNLI data (Williams et al, 2018) following the setup of Ebrahimi et al (2022): 5 epochs with a batch size of 32, and a learning rate of 2e−5. We evaluate the TA every 625 steps and choose the one with the best English validation accuracy.…”
Section: Methodsmentioning
confidence: 99%
“…Multilingual benchmarks or datasets are created in a variety of ways. Several benchmarks are created by translating monolingual benchmarks into different languages, usually through a professional translation service (Artetxe et al, 2020;Conneau et al, 2018;Ebrahimi et al, 2022;Lewis et al, 2020;Li et al, 2021a;FitzGerald et al, 2022;Longpre et al, 2021;Mostafazadeh et al, 2016;Zhang et al, 2019;Lin et al, 2021b;Ponti et al, 2020). Other multilingual benchmarks, instead, have been built by separately annotating each language via its native speakers (e.g.…”
Section: Generalisation Across Languagesmentioning
confidence: 99%
“…Conneau et al (2018) present XNLI, a multilingual dataset created by translating English NLI examples into other languages. The interest in multilingual NLI has resulted in the creation of some novel non-English resources such as the Korean NLI corpus (Ham et al, 2020), Chinese NLI corpus (Hu et al, 2020), Persian NLI corpus (Amirkhani et al, 2020), Indonesian NLI corpus (Mahendra et al, 2021), and indigenous languages of the Americas NLI corpus (Ebrahimi et al, 2022). For Spanish, the only available resources are the Spanish portion of XNLI and the SPARTE corpus for RTE (Peñas et al, 2006) which was adapted from Question Answering data.…”
Section: Related Workmentioning
confidence: 99%