Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2022
DOI: 10.18653/v1/2022.naacl-main.168
|View full text |Cite
|
Sign up to set email alerts
|

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Abstract: Dense retrieval approaches can overcome the lexical gap and lead to significantly improved search results. However, they require large amounts of training data which is not available for most domains. As shown in previous work (Thakur et al., 2021b), the performance of dense retrievers severely degrades under a domain shift. This limits the usage of dense retrieval approaches to only a few domains with large training datasets.In this paper, we propose the novel unsupervised domain adaptation method Generative … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
11
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 39 publications
(33 citation statements)
references
References 11 publications
1
11
0
Order By: Relevance
“…In NLP, Wang et al [171] used Generative Pseudo Labeling (GPL) for query-passage extraction purposes: where they retrieved positive passages from labeled data and applied that model for retrieving negative passages in target data. Thereafter, they used Margin-MSE loss which helped the cross-encoder to soft-label query-passage pairs effectively.…”
Section: ) Pseudo-semi-supervised Domain Adaptationmentioning
confidence: 99%
“…In NLP, Wang et al [171] used Generative Pseudo Labeling (GPL) for query-passage extraction purposes: where they retrieved positive passages from labeled data and applied that model for retrieving negative passages in target data. Thereafter, they used Margin-MSE loss which helped the cross-encoder to soft-label query-passage pairs effectively.…”
Section: ) Pseudo-semi-supervised Domain Adaptationmentioning
confidence: 99%
“…Creating datasets is very expensive but often necessary for domain adaptation. A growing trend is the generation of synthetic QA datasets from models [137] or unstructured text using different techniques such as ICT [138], GPL [139], GenQ [140], Promptagator [141], COCO-DR [142]. Some other technics like natural language augmentation [143] aims at enriching existing datasets for a more robust training through transformation and data filtering.…”
Section: Big Bench Datasets Formentioning
confidence: 99%
“…In addition, to prompt-based generation of training data, there are multiple proposals for self-supervised adaptation of out-of-domain models using generative pseudo-labeling [22,38,51]. To this end, questions or queries are generated using a pretrained seq2seq model (though an LLMs can be used as well) and negative examples are mined using either BM25 or an out-of-domain retriever or ranker.…”
Section: Related Workmentioning
confidence: 99%