Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.40
|View full text |Cite
|
Sign up to set email alerts
|

Don’t Use English Dev: On the Zero-Shot Cross-Lingual Evaluation of Contextual Embeddings

Abstract: Multilingual contextual embeddings have demonstrated state-of-the-art performance in zero-shot cross-lingual transfer learning, where multilingual BERT is fine-tuned on one source language and evaluated on a different target language. However, published results for mBERT zero-shot accuracy vary as much as 17 points on the MLDoc classification task across four papers. We show that the standard practice of using English dev accuracy for model selection in the zero-shot setting makes it difficult to obtain reprod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
14
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 20 publications
(25 citation statements)
references
References 17 publications
2
14
0
Order By: Relevance
“…With exactly the same training data, using different random seeds yields a 1-2 accuracy difference of FS-XLT (Figure 1 top). A similar phenomenon has been observed in finetuning monolingual encoders (Dodge et al, 2020) and multilingual encoders with ZS-XLT (Keung et al, 2020a;Wu and Dredze, 2020b;Xia et al, 2020); we show this observation also holds for FS-XLT. The key takeaway is that varying the buckets is a more severe problem.…”
Section: Target-adapting Resultssupporting
confidence: 90%
See 2 more Smart Citations
“…With exactly the same training data, using different random seeds yields a 1-2 accuracy difference of FS-XLT (Figure 1 top). A similar phenomenon has been observed in finetuning monolingual encoders (Dodge et al, 2020) and multilingual encoders with ZS-XLT (Keung et al, 2020a;Wu and Dredze, 2020b;Xia et al, 2020); we show this observation also holds for FS-XLT. The key takeaway is that varying the buckets is a more severe problem.…”
Section: Target-adapting Resultssupporting
confidence: 90%
“…For MLDoc, our results are comparable to (Dong and de Melo, 2019;Wu and Dredze, 2019;Eisenschlos et al, 2019). It is worth noting that reproducing the exact results is challenging, as suggested by Keung et al (2020a). For MARC, our zero-shot results are worse than Keung et al (2020b)'s who use the dev set of each target language for model selection while we use EN dev, following the common true ZS-XLT setup.…”
Section: Source-training Resultsmentioning
confidence: 64%
See 1 more Smart Citation
“…To choose the final model, we use the scores on the English development data. We are aware that this was recently shown to be sub-optimal in some settings (Keung et al, 2020), however there is no clear solution on how to circumvent this in a pure zero-shot cross-lingual setup (i.e. without assuming any target language target task annotation data).…”
Section: Methodsmentioning
confidence: 99%
“…Following Keung et al (2020), in all experiments the other hyper-parameters are tuned on each target language dev set. We train all models for 10 epochs and choose the best model checkpoint with the target dev set.…”
Section: Implementation Detailsmentioning
confidence: 99%