ANTIQUE: A Non-factoid Question Answering Benchmark

Hashemi, Helia; Aliannejadi, Mohammad; Zamani, Hamed; Croft, W. Bruce

doi:10.1007/978-3-030-45442-5_21

Cited by 47 publications

(51 citation statements)

References 14 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The predominant method for text matching tasks such as non-factoid answer selection and question similarity is to train a neural architecture on a large quantity of labeled in-domain data. This includes CNN and LSTM models with attention Wang et al, 2016;Rücklé and Gurevych, 2017), compare-aggregate approaches (Wang and Jiang, 2017;Rücklé et al, 2019a), and, more recently, transformer-based models (Hashemi et al, 2020;Mass et al, 2019). Fine-tuning of large pre-trained transformers such as BERT (Devlin et al, 2019) and RoBERTa (Liu et al, 2019) currently achieves stateof-the-art performances on many related benchmarks (Garg et al, 2020;Mass et al, 2019;Rochette et al, 2019;Nogueira and Cho, 2019).…”

Section: Related Workmentioning

confidence: 99%

MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale

Rücklé

Pfeiffer

Gurevych

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

We study the zero-shot transfer capabilities of text matching models on a massive scale, by self-supervised training on 140 source domains from community question answering forums in English. We investigate the model performances on nine benchmarks of answer selection and question similarity tasks, and show that all 140 models transfer surprisingly well, where the large majority of models substantially outperforms common IR baselines. We also demonstrate that considering a broad selection of source domains is crucial for obtaining the best zero-shot transfer performances, which contrasts the standard procedure that merely relies on the largest and most similar domains. In addition, we extensively study how to best combine multiple source domains. We propose to incorporate self-supervised with supervised multi-task learning on all available source domains. Our best zero-shot transfer model considerably outperforms in-domain BERT and the previous state of the art on six benchmarks. Fine-tuning of our model with in-domain data results in additional large gains and achieves the new state of the art on all nine benchmarks.

show abstract

Section: Related Workmentioning

confidence: 99%

MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale

Rücklé

Pfeiffer

Gurevych

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…However, a considerable bottleneck in their development and evaluation is the lack of datasets covering both sub-tasks equally well. Current datasets either focus on retrieval results with dense judgements across ranked lists [7,11], or question answer selections in single candidate texts that can retroactively be converted to retrieval collections, but lead to incomplete retrieval judgements [27].…”

mentioning

confidence: 99%

Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering

Hofstätter

Zlabinger

Sertkan

et al. 2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

There are many existing retrieval and question answering datasets. However, most of them either focus on ranked list evaluation or single-candidate question answering. This divide makes it challenging to properly evaluate approaches concerned with ranking documents and providing snippets or answers for a given query. In this work, we present FiRA: a novel dataset of Fine-Grained Relevance Annotations. We extend the ranked retrieval annotations of the Deep Learning track of TREC 2019 with passage and word level graded relevance annotations for all relevant documents. We use our newly created data to study the distribution of relevance in long documents, as well as the attention of annotators to specific positions of the text. As an example, we evaluate the recently introduced TKL document ranking model. We find that although TKL exhibits state-of-the-art retrieval results for long documents, it misses many relevant passages.

show abstract

“…We employ four datasets and three retrieval tasks: MSDialog (Qu et al, 2018) and MANtIS (Penha et al, 2019) for conversation response ranking, Quora (Iyer et al, 2017) for similar question retrieval and ANTIQUE (Hashemi et al, 2019) for non-factoid question answering. We use the official train, validation and test sets provided by the datasets' creators.…”

Section: Methodsmentioning

confidence: 99%

Slice-Aware Neural Ranking

Penha

Hauff

2020

Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI)

View full text Add to dashboard Cite

Understanding when and why neural ranking models fail for an IR task via error analysis is an important part of the research cycle.Here we focus on the challenges of (i) identifying categories of difficult instances (a pair of question and response candidates) for which a neural ranker is ineffective and (ii) improving neural ranking for such instances. To address both challenges we resort to slice-based learning (Chen et al., 2019) for which the goal is to improve effectiveness of neural models for slices (subsets) of data. We address challenge (i) by proposing different slicing functions (SFs) that select slices of the datasetbased on prior work we heuristically capture different failures of neural rankers. Then, for challenge (ii) we adapt a neural ranking model to learn slice-aware representations, i.e. the adapted model learns to represent the question and responses differently based on the model's prediction of which slices they belong to. Our experimental results 1 across three different ranking tasks and four corpora show that slice-based learning improves the effectiveness by an average of 2% over a neural ranker that is not slice-aware.

show abstract

ANTIQUE: A Non-factoid Question Answering Benchmark

Cited by 47 publications

References 14 publications

MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale

MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale

Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering

Slice-Aware Neural Ranking

Contact Info

Product

Resources

About