Looking Beyond the Surface: A Challenge Set for Reading
            Comprehension over Multiple Sentences

Khashabi, Daniel; Chaturvedi, Snigdha; Roth, Michael; Upadhyay, Shyam; Roth, Dan

doi:10.18653/v1/n18-1023

Cited by 316 publications

(319 citation statements)

References 22 publications

Supporting

Mentioning

283

Contrasting

Unclassified

Order By: Relevance

“…We evaluated ROCC coupled with the proposed QA approach on two QA datasets. We use the standard train/development/test partitions for each dataset, as well as the standard evaluation measures: accuracy for ARC , and F1 m (macro-F1 score), F1 a (micro-F1 score), and EM0 (exact match) for MultiRC (Khashabi et al, 2018a).…”

Section: Empirical Evaluationmentioning

confidence: 99%

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

Yadav

Bethard

Surdeanu

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

We propose an unsupervised strategy for the selection of justification sentences for multihop question answering (QA) that (a) maximizes the relevance of the selected sentences, (b) minimizes the overlap between the selected facts, and (c) maximizes the coverage of both question and answer. This unsupervised sentence selection method can be coupled with any supervised QA approach. We show that the sentences selected by our method improve the performance of a state-of-the-art supervised QA model on two multi-hop QA datasets: AI2's Reasoning Challenge (ARC) and Multi-Sentence Reading Comprehension (MultiRC). We obtain new state-of-the-art performance on both datasets among approaches that do not use external resources for training the QA system: 56.82% F1 on ARC (41.24% on Challenge and 64.49% on Easy) and 26.1% EM0 on MultiRC. Our justification sentences have higher quality than the justifications selected by a strong information retrieval baseline, e.g., by 5.4% F1 in MultiRC. We also show that our unsupervised selection of justification sentences is more stable across domains than a state-of-the-art supervised sentence selection method.

show abstract

Section: Empirical Evaluationmentioning

confidence: 99%

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

Yadav

Bethard

Surdeanu

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

show abstract

“…There exist several other multi-hop reasoning datasets including WorldTree , OpenBookQA (Mihaylov et al, 2018), and Mul-tiRC (Khashabi et al, 2018). These datasets are more complex to analyze since the answers may not appear directly in the passage and may simply be entailed by passage content.…”

Section: Discussionmentioning

confidence: 99%

Understanding Dataset Design Choices for Multi-hop Reasoning

Chen

Durrett

2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

Learning multi-hop reasoning has been a key challenge for reading comprehension models, leading to the design of datasets that explicitly focus on it. Ideally, a model should not be able to perform well on a multi-hop question answering task without doing multi-hop reasoning. In this paper, we investigate two recently proposed datasets, WikiHop (Welbl et al., 2018) and HotpotQA . First, we explore sentence-factored models for these tasks; by design, these models cannot do multi-hop reasoning, but they are still able to solve a large number of examples in both datasets. Furthermore, we find spurious correlations in the unmasked version of WikiHop, which make it easy to achieve high performance considering only the questions and answers. Finally, we investigate one key difference between these datasets, namely spanbased vs. multiple-choice formulations of the QA task. Multiple-choice versions of both datasets can be easily gamed, and two models we examine only marginally exceed a baseline in this setting. Overall, while these datasets are useful testbeds, high-performing models may not be learning as much multi-hop reasoning as previously thought.

show abstract

“…The difference between QAngaroo and our focus is two-fold: (1) QAngaroo does not have supervised evidence and (2) the questions in QAngaroo are inherently limited because the dataset is constructed using a knowledge base. MultiRC (Khashabi et al, 2018) is also an explainable multi-hop QA dataset that provides gold evidence sentences. However, it is difficult to compare the performance of the evidence extraction with other studies because its evaluation script and leaderboard do not report the evidence extraction score.…”

Section: Reading Comprehensionmentioning

confidence: 99%

Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction

Nishida¹,

Nishida²,

Nagata³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Question answering (QA) using textual sources for purposes such as reading comprehension (RC) has attracted much attention. This study focuses on the task of explainable multi-hop QA, which requires the system to return the answer with evidence sentences by reasoning and gathering disjoint pieces of the reference texts. It proposes the Query Focused Extractor (QFE) model for evidence extraction and uses multi-task learning with the QA model. QFE is inspired by extractive summarization models; compared with the existing method, which extracts each evidence sentence independently, it sequentially extracts evidence sentences by using an RNN with an attention mechanism on the question sentence. It enables QFE to consider the dependency among the evidence sentences and cover important information in the question sentence. Experimental results show that QFE with a simple RC baseline model achieves a state-of-the-art evidence extraction score on HotpotQA. Although designed for RC, it also achieves a state-of-the-art evidence extraction score on FEVER, which is a recognizing textual entailment task on a large textual database.

show abstract

Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences

Cited by 316 publications

References 22 publications

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

Understanding Dataset Design Choices for Multi-hop Reasoning

Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction

Contact Info

Product

Resources

About