Multi-View Domain Adapted Sentence Embeddings for Low-Resource Unsupervised Duplicate Question Detection

Poerner, Nina; Schütze, Hinrich

doi:10.18653/v1/d19-1173

Cited by 17 publications

(13 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Current models seemingly match similar keywords or phrases of the questions and answers, often without truly understanding them in context. (Rücklé et al, 2019b), ‡ is the MICRON model (Han et al, 2019), is the BERT model in (Ma et al, 2019), and is MV-DASE (Poerner and Schütze, 2019). Table 5: A mistake of MultiCQA RBa-lg (zero-shot transfer) on AskUbuntu.…”

Section: Discussionmentioning

confidence: 99%

“…Shah et al (2018) use adversarial domain adaptation for duplicate question detection. Poerner and Schütze (2019) adapt the combination of different sentence embeddings to individual target domains. Rücklé et al (2019b) use weakly supervised training, self-supervised training methods, and question generation.…”

Section: Related Workmentioning

confidence: 99%

“…In contrast to other text matching tasks in NLP, they compare texts of different lengths-e.g., answers can be long explanations or descriptions-and often deal with expert domains. This makes it difficult to transfer models across domains (Shah et al, 2018), and to apply common approaches such as universal sentence embeddings without further domain or task adaptations (Poerner and Schütze, 2019).…”

Section: Introductionmentioning

confidence: 99%

“…Reasons are that (1) there exist a large number of domains, and (2) in-domain training data is often scarce. Previous work proposed domain adaptation techniques (Poerner and Schütze, 2019;Shah et al, 2018), training with unlabeled data (Rücklé et al, 2019b), and shallow architectures (Rücklé et al, 2019a). However, these approaches result in entirely separate models that are specialized to individual target domains.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale

Rücklé

Pfeiffer

Gurevych

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

We study the zero-shot transfer capabilities of text matching models on a massive scale, by self-supervised training on 140 source domains from community question answering forums in English. We investigate the model performances on nine benchmarks of answer selection and question similarity tasks, and show that all 140 models transfer surprisingly well, where the large majority of models substantially outperforms common IR baselines. We also demonstrate that considering a broad selection of source domains is crucial for obtaining the best zero-shot transfer performances, which contrasts the standard procedure that merely relies on the largest and most similar domains. In addition, we extensively study how to best combine multiple source domains. We propose to incorporate self-supervised with supervised multi-task learning on all available source domains. Our best zero-shot transfer model considerably outperforms in-domain BERT and the previous state of the art on six benchmarks. Fine-tuning of our model with in-domain data results in additional large gains and achieves the new state of the art on all nine benchmarks.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale

Rücklé

Pfeiffer

Gurevych

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Semantic textual similarity (STS) measures the degree of semantic equivalence between two text snippets, based on a graded numerical value, with applications including question answering (Yadav et al, 2020), duplicate detection (Poerner and Schütze, 2019), and entity linking (Zhou et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

Learning from Unlabelled Data for Clinical Semantic Textual Similarity

Wang¹,

Verspoor²,

Baldwin³

2020

Proceedings of the 3rd Clinical Natural Language Processing Workshop

View full text Add to dashboard Cite

Domain pretraining followed by task finetuning has become the standard paradigm for NLP tasks, but requires in-domain labelled data for task fine-tuning. To overcome this, we propose to utilise unlabelled domain data by assigning pseudo-labels from a general model. We evaluate the approach on two clinical STS datasets, and achieve r = 0.80 on N2C2-STS. Further investigation reveals that if the data distribution of unlabelled sentence pairs is closer to the test data, we can obtain better performance. By leveraging a large general-purpose STS dataset and small-scale in-domain training data, we obtain further improvements to r = 0.90, a new SOTA.

show abstract