This article presents a systematic literature review on quantifying the proximity between independently trained monolingual word embedding spaces. A search was carried out in the broader context of inducing bilingual lexicons from cross-lingual word embeddings, especially for low-resource languages. The returned articles were then classified. Cross-lingual word embeddings have drawn the attention of researchers in the field of natural language processing (NLP). Although existing methods have yielded satisfactory results for resource-rich languages and languages related to them, some researchers have pointed out that the same is not true for low-resource and distant languages. In this paper, we report the research on methods proposed to provide better representation for low-resource and distant languages in the cross-lingual word embedding space.
Suboptimal performance of cross-lingual word embeddings for distant and low-resource languages calls into question the isomorphic assumption integral to the mapping-based methods of obtaining such embeddings. This paper investigates the comparative impact of typological relationship and corpus size on the isomorphism between monolingual embedding spaces. To that end, two clustering algorithms were applied to three sets of pairwise degrees of isomorphisms. It is also the goal of the paper to determine the combination of the isomorphism measure and clustering algorithm that best captures the typological relationship among the chosen set of languages. Of the three measures investigated, Relational Similarity seemed to capture best the typological information of the languages encoded in their respective embedding spaces. These language clusters can help us identify, without any pre-existing knowledge about the real-world linguistic relationships shared among a group of languages, the related higher-resource languages of low-resource languages. The presence of such languages in the cross-lingual embedding space can help improve the performance of low-resource languages in a cross-lingual embedding space.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.