Cross-Lingual Document Retrieval Using Regularized Wasserstein Distance

Balikas, Georgios; Laclau, Charlotte; Redko, Ievgen; Amini, Massih-Reza

doi:10.1007/978-3-319-76941-7_30

Cited by 7 publications

(7 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Textual metrics that consider specific qualities in the system outputs, like complexity and diversity, are also used to evaluate NLG systems (Dusek et al, 2019;Hashimoto et al, 2019;Sagarkar et al, 2018;Purdy et al, 2018). Word mover's distance has recently been used for NLP tasks like learning word embeddings (Zhang et al, 2017;Wu et al, 2018), textual entailment (Sulea, 2017), document similarity and classification (Kusner et al, 2015;Huang et al, 2016;Atasu et al, 2017), image captioning (Kilickaya et al, 2017), document retrieval (Balikas et al, 2018), clustering for semantic word-rank (Zhang and Wang, 2018), and as additional loss for text generation that measures the optimal transport between the generated hypothesis and reference text (Chen et al, 2019). We investigate WMD for multi-sentence text evaluation and generation and introduce sentence embedding-based metrics.…”

Section: Related Workmentioning

confidence: 99%

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Clark

Çelikyılmaz

Smith

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

127

105

View full text Add to dashboard Cite

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover's similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences) and human-authored essays (average length of 7.5). We also show that sentence mover's similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.

show abstract

Section: Related Workmentioning

confidence: 99%

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Clark

Çelikyılmaz

Smith

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

127

105

View full text Add to dashboard Cite

show abstract

“…Since the original WMD is computationally expensive, we approximate the distance by using the Regularized Wasserstein distance proposed by [41] and only keep the five closest articles. The five articles with the least distance are then selected for computation with the original WMD.…”

Section: Content Analysis: Semantic Distance Analysismentioning

confidence: 99%

Supporting verification of news articles with automated search for semantically similar articles

Gupta,

Beckh,

Giesselbach

et al. 2021

Preprint

View full text Add to dashboard Cite

Fake information poses one of the major threats for society in the 21st century. Identifying misinformation has become a key challenge due to the amount of fake news that is published daily. Yet, no approach is established that addresses the dynamics and versatility of fake news editorials. Instead of classifying content, we propose an evidence retrieval approach to handle fake news. The learning task is formulated as an unsupervised machine learning problem. For validation purpose, we provide the user with a set of news articles from reliable news sources supporting the hypothesis of the news article in query and the final decision is left to the user. Technically we propose a two-step process: (i) Aggregation-step: With information extracted from the given text we query for similar content from reliable news sources. (ii) Refining-step: We narrow the supporting evidence down by measuring the semantic distance of the text with the collection from step (i). The distance is calculated based on Word2Vec and the Word Mover's Distance. In our experiments, only content that is below a certain distance threshold is considered as supporting evidence. We find that our approach is agnostic to concept drifts, i.e. the machine learning task is independent of the hypotheses in a text. This makes it highly adaptable in times where fake news is as diverse as classical news is. Our pipeline offers the possibility for further analysis in the future, such as investigating bias and differences in news reporting.

show abstract

“…However these methods have been solely applied in the monolingual space. Other methods have been proposed to leverage EMD for cross-lingual document retrieval [4], however these methods treat individual words as the base semantic unit for comparison. The large number of tokens present in web documents coupled with the cubic complexity of WMD make these approaches intractable for large-scale web-alignment.…”

Section: Related Workmentioning

confidence: 99%

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

El-Kishky¹,

Guzmán²

2020

Preprint

View full text Add to dashboard Cite

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel bitexts for machine translation training. In this paper we develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages. These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs. Recognizing that our proposed scoring function and other state of the art methods are computationally intractable for long web documents, we utilize a more tractable greedy algorithm that performs comparably. We experimentally demonstrate that our distance metric performs better alignment than current baselines outperforming them by 7% on high-resource language pairs, 15% on mid-resource language pairs, and 22% on low-resource language pairs.

show abstract

Cross-Lingual Document Retrieval Using Regularized Wasserstein Distance

Cited by 7 publications

References 18 publications

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Supporting verification of news articles with automated search for semantically similar articles

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

Contact Info

Product

Resources

About