Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.358
|View full text |Cite
|
Sign up to set email alerts
|

Identifying Elements Essential for BERT’s Multilinguality

Abstract: It has been shown that multilingual BERT (mBERT) yields high quality multilingual representations and enables effective zero-shot transfer. This is surprising given that mBERT does not use any crosslingual signal during training. While recent literature has studied this phenomenon, the reasons for the multilinguality are still somewhat obscure. We aim to identify architectural properties of BERT and linguistic properties of languages that are necessary for BERT to become multilingual. To allow for fast experim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
39
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 37 publications
(43 citation statements)
references
References 24 publications
3
39
0
Order By: Relevance
“…In our implementations, we instead use the hidden vectors of [CLS] at layer 8 to perform contrastive learning for base-size (12 layers) models, and layer 12 for large-size (24 layers) models. Because previous analysis (Sabet et al, 2020;Dufter and Schütze, 2020;Conneau et al, 2020b) shows that the specific layers of MMLM learn more universal representations and work better on crosslingual retrieval tasks than other layers. We choose the layers following the same principle.…”
Section: Cross-lingual Contrastive Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…In our implementations, we instead use the hidden vectors of [CLS] at layer 8 to perform contrastive learning for base-size (12 layers) models, and layer 12 for large-size (24 layers) models. Because previous analysis (Sabet et al, 2020;Dufter and Schütze, 2020;Conneau et al, 2020b) shows that the specific layers of MMLM learn more universal representations and work better on crosslingual retrieval tasks than other layers. We choose the layers following the same principle.…”
Section: Cross-lingual Contrastive Learningmentioning
confidence: 99%
“…Let (c 2 , x 2 ) denote a MMLM instance that is in different language as (c 1 , x 1 ). Because the vocabulary, the position embedding, and special tokens are shared across languages, it is common to find anchor points (Pires et al, 2019;Dufter and Schütze, 2020) where x 1 = x 2 (such as subword, punctuation, and digit) or I(x 1 , x 2 ) is positive (i.e., the representations are associated or isomorphic). With the bridge effect of {x 1 , x 2 }, MMLM obtains a v-structure dependency "c 1 → {x 1 , x 2 } ← c 2 ", which leads to a negative co-information (i.e., interaction information) I(c 1 ; c 2 ; {x 1 , x 2 }) (Tsujishita, 1995).…”
Section: Multilingual Masked Language Modelingmentioning
confidence: 99%
“…In recent years, several pre-trained multilingual language models are proposed for zero-shot crosslingual transfer, including multilingual BERT (Devlin et al, 2019), XLM (Conneau and Lample, 2019), and XLM-R (Conneau et al, 2020a;Goyal et al, 2021). Many studies put attentions on the rationales that make zero-shot cross-lingual transfer work (K et al, 2020;Lauscher et al, 2020;Conneau et al, 2020b;Artetxe et al, 2020;Dufter and Schütze, 2020). Various tasks and datasests are presented to facilitate zero-shot cross-lingual transfer learning (Conneau et al, 2018;Yang et al, 2019;Clark et al, 2020;Artetxe et al, 2020;Lewis et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…In the meantime, efficiency in the cross-lingual transfer of recently released pretrained multilingual language models (Devlin et al, 2019;Conneau et al, 2020a) has boosted an active line of research trying to analyze their representations to understand what favors the emergence of an interlingua. For instance, Pires et al (2019); Dufter and Schütze (2020); Karthikeyan et al (2020) tried to decouple the effect of shared "anchors" 1 from the rest of the model. Very recently, Muller et al (2021) performed a more fine-grained analysis, examining representations at each layer of the model.…”
Section: Related Workmentioning
confidence: 99%