On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

Zhao, Wei; Glavaš, Goran; Peyrard, Maxime; Gao, Yang; West, Robert; Eger, Steffen

doi:10.18653/v1/2020.acl-main.151

Cited by 44 publications

(57 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A variation on our HTER estimator model trained with the vector x = [h; s; r; h s; h r; |h − s|; |h − r|] as input to the feed-forward only succeed in boosting segment-level performance in 8 of the 18 language pairs outlined in section 5 below and the average improvement in Kendall's Tau in those settings was +0.0009. As noted in Zhao et al (2020), while cross-lingual pretrained models are adaptive to multiple languages, the feature space between languages is poorly aligned. On this basis we decided in favor of excluding the source embedding on the intuition that the most important information comes from the reference embedding and reducing the feature space would allow the model to focus more on relevant information.…”

Section: Estimator Modelmentioning

confidence: 99%

COMET: A Neural Framework for MT Evaluation

Rei¹,

Stewart²,

Farinha³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

320

203

View full text Add to dashboard Cite

We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-theart levels of correlation with human judgements. Our framework leverages recent breakthroughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality. To showcase our framework, we train three models with different types of human judgements: Direct Assessments, Human-mediated Translation Edit Rate and Multidimensional Quality Metrics. Our models achieve new state-ofthe-art performance on the WMT 2019 Metrics shared task and demonstrate robustness to high-performing systems.

show abstract

Section: Estimator Modelmentioning

confidence: 99%

COMET: A Neural Framework for MT Evaluation

Rei¹,

Stewart²,

Farinha³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

320

203

View full text Add to dashboard Cite

show abstract

“…Overall, the average performance for mBERT (0.16) is five times better than random guessing, but consistently lower than the performance for mFastText (0.46 on average). 7 Overall, this shows that mBERT does not properly capture multilingual semantics, a finding that is echoed in some other recent works Zhao et al, 2020b). The apparent reason lies in its naive training process, which does not exploit cross-lingual signals but merely trains on the concatenation of all languages.…”

Section: Cross-lingual Semanticsmentioning

confidence: 83%

“…K et al (2020) show that lexical overlap plays no big role in cross-lingual transfer for mBERT, but the depth of the network does, with deeper models having better transfer. Zhao et al (2020b) find that mBERT lacks fine-grained cross-lingual text understanding and can be fooled by adversarial inputs produced by the corrupt input produced by MT systems.…”

Section: Cross-lingual Representationsmentioning

confidence: 97%

Probing Multilingual BERT for Genetic and Typological Signals

Rama¹,

Beinborn²,

Eger³

2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages and compute language distances based on the mBERT representations. We 1) employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance, 2) perform distance matrix regression analysis, finding that the language distances can be best explained by phylogenetic and worst by structural factors and 3) present a novel measure for measuring diachronic meaning stability (based on cross-lingual representation variability) which correlates significantly with published ranked lists based on linguistic approaches. Our results contribute to the nascent field of typological interpretability of cross-lingual text representations.

show abstract

“…Our method is simple, interpretable and produces scores closer to human judgements on an absolute scale, while enabling more finegrained analysis which can be useful to find weak spots in the evaluated model. In future work, we would like to combine knowledge-based signals with unsupervised approaches like YiSi (Lo, 2019) and XMoverScore (Zhao et al, 2020) that use contextualized representations from cross-lingual LMs like multilingual BERT (Devlin et al, 2019). As our method does not require reference translations, we would like to explore scaling it to use much larger or domain specific monolingual datasets.…”

Section: Discussionmentioning

confidence: 99%

KoBE: Knowledge-Based Machine Translation Evaluation

Zorik¹,

Aharoni

Beryozkin

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our approach achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references, which is the largest number of wins for a single evaluation method on this task. On 4 language pairs, we also achieve higher correlation with human judgements than BLEU. To foster further research, we release a dataset containing 1.8 million grounded entity mentions across 18 language pairs from the WMT19 metrics track data.

show abstract

On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

Cited by 44 publications

References 42 publications

COMET: A Neural Framework for MT Evaluation

COMET: A Neural Framework for MT Evaluation

Probing Multilingual BERT for Genetic and Typological Signals

KoBE: Knowledge-Based Machine Translation Evaluation

Contact Info

Product

Resources

About