Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations

Zhang, Rui; Westerfield, Caitlin; Shim, Sungrok; Bingham, Garrett; Fabbri, Alexander R.; Hu, William; Verma, Neha; Radev, Dragomir

doi:10.18653/v1/p19-1306

Cited by 19 publications

(16 citation statements)

References 38 publications

(56 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This allows to use different embeddings inside the same model and helps when two languages do not share the same space inside a single model (Cao et al, 2020). For example, Zhang et al (2019b) used bilingual representations by creating cross-lingual word embeddings using a small set of parallel sentences between the highresource language English and three low-resource African languages, Swahili, Tagalog, and Somali, to improve document retrieval performance for the African languages.…”

Section: Multilingual Language Modelsmentioning

confidence: 99%

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Hedderich¹,

Lange²,

Adel³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

136

View full text Add to dashboard Cite

Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.

show abstract

Section: Multilingual Language Modelsmentioning

confidence: 99%

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Hedderich¹,

Lange²,

Adel³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

136

View full text Add to dashboard Cite

show abstract

“…POSIT-DRMM (Zhang et al 2019), the recent proposed cross-lingual document retrieval model, which is designed for addressing the low-resource issue in CLIR. This model incorporates bilingual representations to capture and aggregate matching signals between an input query in the source language and a document in the target language.…”

Section: Directly Clir Modelsmentioning

confidence: 99%

“…To evaluate the model performance, we follow the conventional settings in related work (Wu et al 2017;Zhou et al 2018;Zhang et al 2019). Specifically, we first calculate the matching scores between a product attribute set and product description candidates, and then rank the matching scores of all candidates to calculate the following automatic metrics, including mean reciprocal rank (MRR) (Voorhees and others 1999), and recall at position k in n candidates (Rn@k).…”

Section: Evaluation Metricsmentioning

confidence: 99%

“…Compared with generationbased methods, retrieved product descriptions are more understandable and accessible to overseas users, and training a retrieval model requires less paired data. Accordingly, we formulate this problem as a cross-lingual information retrieval (CLIR) task for global e-commerce, i.e., ranking foreign product descriptions against a local product with an attribute set (Litschko et al 2018;Zhang et al 2019).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cross-Lingual Low-Resource Set-to-Description Retrieval for Global E-Commerce

Liu

Wang

et al. 2020

AAAI

View full text Add to dashboard Cite

With the prosperous of cross-border e-commerce, there is an urgent demand for designing intelligent approaches for assisting e-commerce sellers to offer local products for consumers from all over the world. In this paper, we explore a new task of cross-lingual information retrieval, i.e., cross-lingual set-to-description retrieval in cross-border e-commerce, which involves matching product attribute sets in the source language with persuasive product descriptions in the target language. We manually collect a new and high-quality paired dataset, where each pair contains an unordered product attribute set in the source language and an informative product description in the target language. As the dataset construction process is both time-consuming and costly, the new dataset only comprises of 13.5k pairs, which is a low-resource setting and can be viewed as a challenging testbed for model development and evaluation in cross-border e-commerce. To tackle this cross-lingual set-to-description retrieval task, we propose a novel cross-lingual matching network (CLMN) with the enhancement of context-dependent cross-lingual mapping upon the pre-trained monolingual BERT representations. Experimental results indicate that our proposed CLMN yields impressive results on the challenging task and the context-dependent cross-lingual mapping on BERT yields noticeable improvement over the pre-trained multi-lingual BERT model.

show abstract

“…As machine translation has increased the usefulness of CLIR, recently introduced deep neural methods have improved ranking quality [4,29,43,45,47]. By and large, these techniques appear to provide a large jump in the quality of CLIR output.…”

Section: Introductionmentioning

confidence: 99%

HC4: A New Suite of Test Collections for Ad Hoc CLIR

Lawrie¹,

Mayfield²,

Oard³

et al. 2022

Preprint

View full text Add to dashboard Cite

HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments. New test collections are needed because existing CLIR test collections built using pooling of traditional CLIR runs have systematic gaps in their relevance judgments when used to evaluate neural CLIR methods. The HC4 collections contain 60 topics and about half a million documents for each of Chinese and Persian, and 54 topics and five million documents for Russian. Active learning was used to determine which documents to annotate after being seeded using interactive search and judgment. Documents were judged on a three-grade relevance scale. This paper describes the design and construction of the new test collections and provides baseline results for demonstrating their utility for evaluating systems.

show abstract

Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations

Cited by 19 publications

References 38 publications

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Cross-Lingual Low-Resource Set-to-Description Retrieval for Global E-Commerce

HC4: A New Suite of Test Collections for Ad Hoc CLIR

Contact Info

Product

Resources

About