Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1493
|View full text |Cite
|
Sign up to set email alerts
|

How Multilingual is Multilingual BERT?

Abstract: In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2019) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer work… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

32
826
8
5

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 906 publications
(871 citation statements)
references
References 13 publications
(10 reference statements)
32
826
8
5
Order By: Relevance
“…Similar to multilingual BERT, Mulcaire et al (2019) trains a single ELMo on distantly related languages and shows mixed results as to the benefit of pretaining. Parallel to our work, Pires et al (2019) shows mBERT has good zero-shot cross-lingual transfer performance on NER and POS tagging. They show how subword overlap and word ordering effect mBERT transfer performance.…”
Section: Introductionsupporting
confidence: 72%
“…Similar to multilingual BERT, Mulcaire et al (2019) trains a single ELMo on distantly related languages and shows mixed results as to the benefit of pretaining. Parallel to our work, Pires et al (2019) shows mBERT has good zero-shot cross-lingual transfer performance on NER and POS tagging. They show how subword overlap and word ordering effect mBERT transfer performance.…”
Section: Introductionsupporting
confidence: 72%
“…While Pires et al (2019) hypothesize word order is the main culprit for the poor zero-shot performance for Japanese when transferring a POStagger from English, our experiments with Korean and Japanese show a different picture.…”
Section: Language Outliersmentioning
confidence: 62%
“…Another interesting observation on transformer-based LMs is that multilingual models which were pre-trained from multiple monolingual corpora were able to generalize information across different languages [30]. Wu and Dredze [31] confirmed that a multilingual BERT model performed well uniformly across languages in document classification, named entity recognition, and part-of-speech tagging, when fine-tuned with a small amount of target language supervision for the downstream task.…”
Section: Related Workmentioning
confidence: 95%