How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions

Glavaš, Goran; Litschko, Robert; Ruder, Sebastian; Vulić, Ivan

doi:10.18653/v1/p19-1070

Cited by 141 publications

(147 citation statements)

References 55 publications

Supporting

Mentioning

137

Contrasting

Order By: Relevance

“…For enid, we used English (100M lines) and Indonesian (77M lines) Common Crawl corpora. 5 We then mapped the word embeddings into a BWE space using VECMAP, 6 one of the best and most robust methods for unsupervised mapping (Glavas et al, 2019). The resulting BWE were used as baselines in our evaluation tasks and also to bootstrap our USMT system.…”

Section: Settings For Training Bwementioning

confidence: 99%

See 1 more Smart Citation

Unsupervised Joint Training of Bilingual Word Embeddings

Marie

Fujita

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

State-of-the-art methods for unsupervised bilingual word embeddings (BWE) train a mapping function that maps pre-trained monolingual word embeddings into a bilingual space. Despite its remarkable results, unsupervised mapping is also well-known to be limited by the dissimilarity between the original word embedding spaces to be mapped. In this work, we propose a new approach that trains unsupervised BWE jointly on synthetic parallel data generated through unsupervised machine translation. We demonstrate that existing algorithms that jointly train BWE are very robust to noisy training data and show that unsupervised BWE jointly trained significantly outperform unsupervised mapped BWE in several cross-lingual NLP tasks.

show abstract

Section: Settings For Training Bwementioning

confidence: 99%

“…Bilingual lexicon induction (BLI) is by far the most popular evaluation task for BWE used by previous work in spite of its limits (Glavas et al, 2019). In contrast to previous work, we used much larger test sets 10 for each language pair.…”

Section: Task 1: Bilingual Lexicon Inductionmentioning

confidence: 99%

Unsupervised Joint Training of Bilingual Word Embeddings

Marie

Fujita

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Furthermore, unlike Caliskan et al (2017), we test whether biases depend on the selection of the similarity metric. Finally, given the ubiquitous adoption of cross-lingual embeddings (Ruder et al, 2017;Glavaš et al, 2019), we investigate biases in a variety of bilingual embedding spaces.…”

Section: Methodsmentioning

confidence: 99%

Are We Consistently Biased? Multidimensional Analysis of Biases in Distributional Word Vectors

Lauscher¹,

Glavaš²

2019

Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Self Cite

View full text Add to dashboard Cite

Word embeddings have recently been shown to reflect many of the pronounced societal biases (e.g., gender bias or racial bias). Existing studies are, however, limited in scope and do not investigate the consistency of biases across relevant dimensions like embedding models, types of texts, and different languages. In this work, we present a systematic study of biases encoded in distributional word vector spaces: we analyze how consistent the bias effects are across languages, corpora, and embedding models. Furthermore, we analyze the crosslingual biases encoded in bilingual embedding spaces, indicative of the effects of bias transfer encompassed in cross-lingual transfer of NLP models. Our study yields some unexpected findings, e.g., that biases can be emphasized or downplayed by different embedding models or that user-generated content may be less biased than encyclopedic text. We hope our work catalyzes bias research in NLP and informs the development of bias reduction techniques.

show abstract

“…We use word2vec skip-gram model [16] with the following parameters: dimensionality of embeddings was 300, a window size of 10 words, the minimal corpus frequency of 10, negative sampling with 10 samples, no downsampling, 20 iterations over the corpus. Then we use vecmap [17,18] framework to learn a transformation matrix that maps representations in one language to the representations of the 1…”

Section: Cross-lingual Embeddingsmentioning

confidence: 99%

Cross-lingual similar document retrieval methods

Zubarev¹,

Соченков²

2019

Proceedings of ISP RAS

View full text Add to dashboard Cite

In this paper, we compare different methods for cross-lingual similar document retrieval. We focus on Russian-English language pair. We compare well-known methods like Cross Lingual Explicit Semantic Analysis (CL-ESA) with methods based on cross-lingual embeddings. We use approximate nearest neighbor (ANN) search to retrieve documents based entirely on distances between learned document embeddings. Also we employ a more traditional approach with usage of inverted index, with extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings. We use Russian-English aligned Wikipedia articles to evaluate all approaches. Conducted experiments show that an approach with inverted index achieves better performance in terms of recall and MAP than other methods.

show abstract

How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions

Cited by 141 publications

References 55 publications

Unsupervised Joint Training of Bilingual Word Embeddings

Unsupervised Joint Training of Bilingual Word Embeddings

Are We Consistently Biased? Multidimensional Analysis of Biases in Distributional Word Vectors

Cross-lingual similar document retrieval methods

Contact Info

Product

Resources

About