Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction

Kementchedjhieva, Yova; Hartmann, Mareike; Søgaard, Anders

doi:10.18653/v1/d19-1328

Cited by 20 publications

(19 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Table 12 shows the result of Task 2 broken down based on the categorizations made by Kementchedjhieva et al (2019). In some languages, the pretokenization of MWEs improved the translation ac-…”

Section: E Experimental Resultsmentioning

confidence: 99%

“…Some studies (Søgaard et al, 2018;Ormazabal et al, 2019) claim that the accuracy of cross-lingual alignments depends on the similarity of word embeddings spaces of different languages, and this similarity in turn depends on the similarity between the training corpora. Kementchedjhieva et al (2019), illustrating an issue related to evaluation of CWEs, argues that proper nouns constitute a quarter of the MUSE dataset, rendering it not ideal for word translation.…”

Section: The Limitations Of Cwesmentioning

confidence: 99%

“…We also broke down the precision scores in five language pairs based on POS of source words annotated byKementchedjhieva et al (2019) but did not observe meaningful patterns (Appendix E).…”

mentioning

confidence: 98%

See 2 more Smart Citations

Pre-tokenization of Multi-word Expressions in Cross-lingual Word Embeddings

Otani¹,

Ozaki²,

Zhao³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Cross-lingual word embedding (CWE) algorithms represent words in multiple languages in a unified vector space. Multi-Word Expressions (MWE) are common in every language. When training word embeddings, each component word of an MWE gets its own separate embedding, and thus, MWEs are not translated by CWEs. We propose a simple method for word translation of MWEs to and from English in ten languages: we first compile lists of MWEs in each language and then tokenize the MWEs as single tokens before training word embeddings. CWEs are trained on a wordtranslation task using the dictionaries that only contain single words. In order to evaluate MWE translation, we created bilingual word lists from multilingual WordNet that include single-token words and MWEs, and most importantly, include MWEs that correspond to single words in another language. We show that the pre-tokenization of MWEs as single tokens performs better than averaging the embeddings of the individual tokens of the MWE. We can translate MWEs at a top-10 precision of 30-60%. The tokenization of MWEs makes the occurrences of single words in a training corpus more sparse, but we show that it does not pose negative impacts on single-word translations.

show abstract

Section: E Experimental Resultsmentioning

confidence: 99%

Section: The Limitations Of Cwesmentioning

confidence: 99%

See 1 more Smart Citation

Pre-tokenization of Multi-word Expressions in Cross-lingual Word Embeddings

Otani¹,

Ozaki²,

Zhao³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Moreover, existing translation benchmarks have been shown to have several issues on their own. In particular, bilingual lexicon induction datasets have been reported to misrepresent morphological variations, overly focus on named entities and frequent words, and have pervasive gaps in the gold-standard targets (Czarnowska et al, 2019;Kementchedjhieva et al, 2019). More generally, most of these datasets are limited to relatively close languages and comparable corpora.…”

Section: Evaluation Practicesmentioning

confidence: 99%

A Call for More Rigor in Unsupervised Cross-lingual Learning

Artetxe¹,

Ruder²,

Yogatama³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We review motivations, definition, approaches, and methodology for unsupervised crosslingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world's languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also discuss different training signals that have been used in previous work, which depart from the pure unsupervised setting. We then describe common methodological issues in tuning and evaluation of unsupervised cross-lingual models and present best practices. Finally, we provide a unified outlook for different types of research in this area (i.e., cross-lingual word embeddings, deep multilingual pretraining, and unsupervised machine translation) and argue for comparable evaluation of these models.

show abstract

“…Taking advantage of hubness clearly improves performance on the MUSE challenge, but why? Hopefully, the explanation is the one above (most words have relatively few translations), but it is also possible that hubness is taking advantage of flaws in the benchmark such as gaps in MUSE (most words should have many more translations than those in MUSE (Kementchedjhieva, Hartmann, and Søgaard 2019) For example, the antonymy relationship < inexperienced, =, experienced>, is a triple where h is inexperienced, r is = and t is experienced. Heads and tails are typically represented as vectors, and relations are represented as rotation matrices.…”

Section: Background: Rotation Matrices Bli and Kgcmentioning

confidence: 99%

Benchmarks and goals

Church¹

2020

Nat. Lang. Eng.

View full text Add to dashboard Cite

Benchmarks can be a useful step toward the goals of the field (when the benchmark is on the critical path), as demonstrated by the GLUE benchmark, and deep nets such as BERT and ERNIE. The case for other benchmarks such as MUSE and WN18RR is less well established. Hopefully, these benchmarks are on a critical path toward progress on bilingual lexicon induction (BLI) and knowledge graph completion (KGC). Many KGC algorithms have been proposed such as Trans[DEHRM], but it remains to be seen how this work improves WordNet coverage. Given how much work is based on these benchmarks, the literature should have more to say than it does about the connection between benchmarks and goals. Is optimizing P@10 on WN18RR likely to produce more complete knowledge graphs? Is MUSE likely to improve Machine Translation?

show abstract

Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction

Cited by 20 publications

References 20 publications

Pre-tokenization of Multi-word Expressions in Cross-lingual Word Embeddings

Pre-tokenization of Multi-word Expressions in Cross-lingual Word Embeddings

A Call for More Rigor in Unsupervised Cross-lingual Learning

Benchmarks and goals

Contact Info

Product

Resources

About