Identifying Elements Essential for BERT’s Multilinguality

Dufter, Philipp; Schütze, Hinrich

doi:10.18653/v1/2020.emnlp-main.358

Cited by 37 publications

(43 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our implementations, we instead use the hidden vectors of [CLS] at layer 8 to perform contrastive learning for base-size (12 layers) models, and layer 12 for large-size (24 layers) models. Because previous analysis (Sabet et al, 2020;Dufter and Schütze, 2020;Conneau et al, 2020b) shows that the specific layers of MMLM learn more universal representations and work better on crosslingual retrieval tasks than other layers. We choose the layers following the same principle.…”

Section: Cross-lingual Contrastive Learningmentioning

confidence: 99%

“…Let (c 2 , x 2 ) denote a MMLM instance that is in different language as (c 1 , x 1 ). Because the vocabulary, the position embedding, and special tokens are shared across languages, it is common to find anchor points (Pires et al, 2019;Dufter and Schütze, 2020) where x 1 = x 2 (such as subword, punctuation, and digit) or I(x 1 , x 2 ) is positive (i.e., the representations are associated or isomorphic). With the bridge effect of {x 1 , x 2 }, MMLM obtains a v-structure dependency "c 1 → {x 1 , x 2 } ← c 2 ", which leads to a negative co-information (i.e., interaction information) I(c 1 ; c 2 ; {x 1 , x 2 }) (Tsujishita, 1995).…”

Section: Multilingual Masked Language Modelingmentioning

confidence: 99%

See 1 more Smart Citation

InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

Chi¹,

Dong²,

Wei³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

147

118

View full text Add to dashboard Cite

In this work, we present an informationtheoretic framework that formulates crosslingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts.The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, inspired by the framework, we propose a new pretraining task based on contrastive learning. Specifically, we regard a bilingual sentence pair as two views of the same meaning and encourage their encoded representations to be more similar than the negative examples. By leveraging both monolingual and parallel corpora, we jointly train the pretext tasks to improve the cross-lingual transferability of pre-trained models. Experimental results on several benchmarks show that our approach achieves considerably better performance. The code and pre-trained models are available at https://aka.ms/infoxlm.

show abstract

Section: Cross-lingual Contrastive Learningmentioning

confidence: 99%

Section: Multilingual Masked Language Modelingmentioning

confidence: 99%

InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

Chi¹,

Dong²,

Wei³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

147

118

View full text Add to dashboard Cite

show abstract

“…In recent years, several pre-trained multilingual language models are proposed for zero-shot crosslingual transfer, including multilingual BERT (Devlin et al, 2019), XLM (Conneau and Lample, 2019), and XLM-R (Conneau et al, 2020a;Goyal et al, 2021). Many studies put attentions on the rationales that make zero-shot cross-lingual transfer work (K et al, 2020;Lauscher et al, 2020;Conneau et al, 2020b;Artetxe et al, 2020;Dufter and Schütze, 2020). Various tasks and datasests are presented to facilitate zero-shot cross-lingual transfer learning (Conneau et al, 2018;Yang et al, 2019;Clark et al, 2020;Artetxe et al, 2020;Lewis et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training

Huang¹,

Ahmad²,

Peng³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Pre-trained multilingual language encoders, such as multilingual BERT and XLM-R, show great potential for zero-shot cross-lingual transfer. However, these multilingual encoders do not precisely align words and phrases across languages. Especially, learning alignments in the multilingual embedding space usually requires sentence-level or word-level parallel corpora, which are expensive to be obtained for low-resource languages. An alternative is to make the multilingual encoders more robust; when fine-tuning the encoder using downstream task, we train the encoder to tolerate noise in the contextual embedding spaces such that even if the representations of different languages are not aligned well, the model can still achieve good performance on zero-shot cross-lingual transfer. In this work, we propose a learning strategy for training robust models by drawing connections between adversarial examples and the failure cases of zero-shot cross-lingual transfer. We adopt two widely used robust training methods, adversarial training and randomized smoothing, to train the desired robust model. The experimental results demonstrate that robust training improves zero-shot cross-lingual transfer on text classification tasks. The improvement is more significant in the generalized crosslingual transfer setting, where the pair of input sentences belong to two different languages.

show abstract

“…In the meantime, efficiency in the cross-lingual transfer of recently released pretrained multilingual language models (Devlin et al, 2019;Conneau et al, 2020a) has boosted an active line of research trying to analyze their representations to understand what favors the emergence of an interlingua. For instance, Pires et al (2019); Dufter and Schütze (2020); Karthikeyan et al (2020) tried to decouple the effect of shared "anchors" 1 from the rest of the model. Very recently, Muller et al (2021) performed a more fine-grained analysis, examining representations at each layer of the model.…”

Section: Related Workmentioning

confidence: 99%

Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads?

Kim¹,

Besacier²,

Nikoulina³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Recent studies on the analysis of the multilingual representations focus on identifying whether there is an emergence of languageindependent representations, or whether a multilingual model partitions its weights among different languages. While most of such work has been conducted in a "black-box" manner, this paper aims to analyze individual components of a multilingual neural translation (NMT) model. In particular, we look at the encoder self-attention and encoder-decoder attention heads (in a many-to-one NMT model) that are more specific to the translation of a certain language pair than others by (1) employing metrics that quantify some aspects of the attention weights such as "variance" or "confidence", and (2) systematically ranking the importance of attention heads with respect to translation quality. Experimental results show that surprisingly, the set of most important attention heads are very similar across the language pairs and that it is possible to remove nearly one-third of the less important heads without hurting the translation quality greatly.

show abstract

Identifying Elements Essential for BERT’s Multilinguality

Cited by 37 publications

References 24 publications

InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training

Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads?

Contact Info

Product

Resources

About