UDALM: Unsupervised Domain Adaptation through Language Modeling

Karouzos, Constantinos; Paraskevopoulos, Georgios; Potamianos, Alexandros

doi:10.18653/v1/2021.naacl-main.203

Cited by 28 publications

(33 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zero-Shot Models: We apply supervised training on MS MARCO or PAQ and evaluate the trained retrievers on the target datasets. Previous Domain Adaptation Methods: We include two previous unsupervised domain adaptation methods, UDALM (Karouzos et al, 2021) and MoDIR Thakur et al (2021b) to train QGen models with the default setting. Co-12 https://github.com/UKPLab/beir sine similarity is used and the models are fine-tuned for 1 epoch with MNRL.…”

Section: Baselinesmentioning

confidence: 99%

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Wang¹,

Nandan²,

Reimers³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Dense retrieval approaches can overcome the lexical gap and lead to significantly improved search results. However, they require large amounts of training data which is not available for most domains. As shown in previous work (Thakur et al., 2021b), the performance of dense retrievers severely degrades under a domain shift. This limits the usage of dense retrieval approaches to only a few domains with large training datasets.In this paper, we propose the novel unsupervised domain adaptation method Generative Pseudo Labeling (GPL), which combines a query generator with pseudo labeling from a cross-encoder. On six representative domainspecialized datasets, we find the proposed GPL can outperform an out-of-the-box state-of-theart dense retrieval approach by up to 9.3 points

show abstract

Section: Baselinesmentioning

confidence: 99%

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Wang¹,

Nandan²,

Reimers³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…The evaluation results on the original BioASQ and TREC-COVID are available at Appendix C. Evaluation is done using nDCG@10. Previous Domain Adaptation Methods: We include two previous unsupervised domain adaptation methods, UDALM (Karouzos et al, 2021) and MoDIR . uses the default setting in original paper, where 15% tokens in a text are sampled to be masked and are needed to be predicted.…”

Section: Discussionmentioning

confidence: 99%

“…MoDIR trains models by generating domain invariant representations to attack a domain classifier. However, as argued in Karouzos et al (2021), DAT trains models by minimizing the distance between representations from different domains and such learning objective can result in bad embedding space and unstable performance. For sentiment classification, Karouzos et al (2021) proposes UDALM based on multiple stages of training.…”

Section: Related Workmentioning

confidence: 99%

“…However, as argued in Karouzos et al (2021), DAT trains models by minimizing the distance between representations from different domains and such learning objective can result in bad embedding space and unstable performance. For sentiment classification, Karouzos et al (2021) proposes UDALM based on multiple stages of training. UDALM first applies MLM training on the target domain; and it then applies multi-task learning on the target domain with MLM and on the source domain with a supervised objective.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Wang¹,

Nandan²,

Reimers³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…While we use lessons about careful design choices, our basic models use standard fine-tuning for simplicity. Efforts on the data side have focused on intermediate fine-tuning either by using unlabeled target domain data (Karouzos et al, 2021;Gururangan et al, 2020) or via labeled data from other tasks (Phang et al, 2018;Aghajanyan et al, 2021;Vu et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Few-shot Mining of Naturally Occurring Inputs and Outputs

Joshi¹,

Blevins²,

Lewis³

et al. 2022

Preprint

View full text Add to dashboard Cite

Creating labeled natural language training data is expensive and requires significant human effort. We mine input output examples from large corpora using a supervised mining function trained using a small seed set of only 100 examples. The mining consists of two stages -(1) a biencoder-based recall-oriented dense search which pairs inputs with potential outputs, and (2) a crossencoder-based filter which re-ranks the output of the biencoder stage for better precision. Unlike model-generated data augmentation, our method mines naturally occurring high-quality input output pairs to mimic the style of the seed set for multiple tasks. On SQuAD-style reading comprehension, augmenting the seed set with the mined data results in an improvement of 13 F1 over a BART-large baseline fine-tuned only on the seed set. Likewise, we see improvements of 1.46 ROUGE-L on Xsum abstractive summarization.

show abstract

UDALM: Unsupervised Domain Adaptation through Language Modeling

Cited by 28 publications

References 44 publications

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Few-shot Mining of Naturally Occurring Inputs and Outputs

Contact Info

Product

Resources

About