Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1328
|View full text |Cite
|
Sign up to set email alerts
|

Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction

Abstract: The task of bilingual dictionary induction (BDI) is commonly used for intrinsic evaluation of cross-lingual word embeddings. The largest dataset for BDI was generated automatically, so its quality is dubious. We study the composition and quality of the test sets for five diverse languages from this dataset, with concerning findings: (1) a quarter of the data consists of proper nouns, which can be hardly indicative of BDI performance, and (2) there are pervasive gaps in the gold-standard targets. These issues a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
16
0
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(19 citation statements)
references
References 20 publications
1
16
0
1
Order By: Relevance
“…Table 12 shows the result of Task 2 broken down based on the categorizations made by Kementchedjhieva et al (2019). In some languages, the pretokenization of MWEs improved the translation ac-…”
Section: E Experimental Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Table 12 shows the result of Task 2 broken down based on the categorizations made by Kementchedjhieva et al (2019). In some languages, the pretokenization of MWEs improved the translation ac-…”
Section: E Experimental Resultsmentioning
confidence: 99%
“…Some studies (Søgaard et al, 2018;Ormazabal et al, 2019) claim that the accuracy of cross-lingual alignments depends on the similarity of word embeddings spaces of different languages, and this similarity in turn depends on the similarity between the training corpora. Kementchedjhieva et al (2019), illustrating an issue related to evaluation of CWEs, argues that proper nouns constitute a quarter of the MUSE dataset, rendering it not ideal for word translation.…”
Section: The Limitations Of Cwesmentioning
confidence: 99%
See 1 more Smart Citation
“…Moreover, existing translation benchmarks have been shown to have several issues on their own. In particular, bilingual lexicon induction datasets have been reported to misrepresent morphological variations, overly focus on named entities and frequent words, and have pervasive gaps in the gold-standard targets (Czarnowska et al, 2019;Kementchedjhieva et al, 2019). More generally, most of these datasets are limited to relatively close languages and comparable corpora.…”
Section: Evaluation Practicesmentioning
confidence: 99%
“…Taking advantage of hubness clearly improves performance on the MUSE challenge, but why? Hopefully, the explanation is the one above (most words have relatively few translations), but it is also possible that hubness is taking advantage of flaws in the benchmark such as gaps in MUSE (most words should have many more translations than those in MUSE (Kementchedjhieva, Hartmann, and Søgaard 2019) For example, the antonymy relationship < inexperienced, =, experienced>, is a triple where h is inexperienced, r is = and t is experienced. Heads and tails are typically represented as vectors, and relations are represented as rotation matrices.…”
Section: Background: Rotation Matrices Bli and Kgcmentioning
confidence: 99%