Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.363
|View full text |Cite
|
Sign up to set email alerts
|

From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers

Abstract: Massively multilingual transformers (MMTs) pretrained via language modeling (e.g., mBERT, XLM-R) have become a default paradigm for zero-shot language transfer in NLP, offering unmatched transfer performance. Current evaluations, however, verify their efficacy in transfers (a) to languages with sufficiently large pretraining corpora, and (b) between close languages. In this work, we analyze the limitations of downstream language transfer with MMTs, showing that, much like cross-lingual word embeddings, they ar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

17
191
2

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 198 publications
(273 citation statements)
references
References 51 publications
(67 reference statements)
17
191
2
Order By: Relevance
“…This effectively means that M-BERT's subword vocabulary contains plenty of CMN-specific and YUE-specific subwords that are exploited by the encoder when producing M-BERT-based representations. Simultaneously, higher scores with M-BERT (and XLM in Table 13) are reported for resource-rich languages such as French, Spanish, and English, which are better represented in M-BERT's training data, while we observe large performance losses for lower-resource languages: These artifacts of massively multilingual training with M-BERT and XLM and lower performance in low-resource languages was further validated recently (Lauscher et al 2020;Wu and Dredze 2020). We also observe lower absolute scores (and a larger number of OOVs) for languages with very rich and productive morphological systems such as the two Slavic languages (Polish and Russian) and Finnish.…”
Section: Table 13supporting
confidence: 78%
“…This effectively means that M-BERT's subword vocabulary contains plenty of CMN-specific and YUE-specific subwords that are exploited by the encoder when producing M-BERT-based representations. Simultaneously, higher scores with M-BERT (and XLM in Table 13) are reported for resource-rich languages such as French, Spanish, and English, which are better represented in M-BERT's training data, while we observe large performance losses for lower-resource languages: These artifacts of massively multilingual training with M-BERT and XLM and lower performance in low-resource languages was further validated recently (Lauscher et al 2020;Wu and Dredze 2020). We also observe lower absolute scores (and a larger number of OOVs) for languages with very rich and productive morphological systems such as the two Slavic languages (Polish and Russian) and Finnish.…”
Section: Table 13supporting
confidence: 78%
“…Wu and Dredze (2020) consider the performance on up to 99 languages for NER. In contrast, Lauscher et al (2020) show limitations of the zero-shot setting and Zhao et al (2020) observe poor performance of mBERT in reference-free machine translation evaluation. Prior work here focuses on investigating the degree of multilinguality, not the reasons for it.…”
Section: Related Workmentioning
confidence: 92%
“…For pretraining approaches where labeled data exists in a high-resource language, and the information is transferred to a lowresource language, Hu et al (2020) find a significant gap between performance on English and the cross-lingually transferred models. In a recent study, Lauscher et al (2020) find that the transfer for multilingual transformer models is less effective for resource-lean settings and distant languages. A popular technique to obtain labeled data quickly and cheaply is distant and weak supervision.…”
Section: Introductionmentioning
confidence: 99%