Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.712
|View full text |Cite
|
Sign up to set email alerts
|

Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning

Abstract: Has there been real progress in multi-hop question-answering? Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts. This limits our ability to measure true progress and defeats the purpose of building multi-hop QA datasets. We make three contributions towards addressing this. First, we formalize such undesirable behavior as disconnected reasoning across subsets of supporting facts. This allows developing a model-agnostic probe for me… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
37
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

3
4

Authors

Journals

citations
Cited by 15 publications
(37 citation statements)
references
References 23 publications
0
37
0
Order By: Relevance
“…Our focus is to identify and alleviate reasoning shortcuts in multi-hop QA, without evidence annotations. Models taking shortcuts were widely observed from various tasks, such as object detection (Singh et al, 2020), NLI (Tu et al, 2020), and also for our target task of multi-hop QA (Min et al, 2019;Chen and Durrett, 2019;Trivedi et al, 2020), where models learn simple heuristic rules, answering correctly but without proper reasoning.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Our focus is to identify and alleviate reasoning shortcuts in multi-hop QA, without evidence annotations. Models taking shortcuts were widely observed from various tasks, such as object detection (Singh et al, 2020), NLI (Tu et al, 2020), and also for our target task of multi-hop QA (Min et al, 2019;Chen and Durrett, 2019;Trivedi et al, 2020), where models learn simple heuristic rules, answering correctly but without proper reasoning.…”
Section: Related Workmentioning
confidence: 99%
“…To mitigate the effect of shortcuts, adversarial examples (Jiang and Bansal, 2019) can be generated, or alternatively, models can be robustifed (Trivedi et al, 2020) with additional supervision for paragraph-level "sufficiency" -to identify whether a pair of two paragraphs are sufficient for right reasoning or not, which reduces shortcuts on a single paragraph. While the binary classification for paragraph-sufficiency is relatively easy (96.7 F1 in Trivedi et al (2020)), our target of capturing a finer-grained sentence-evidentiality is more challenging. Existing QA model (Nie et al, 2019;Groeneveld et al, 2020) treats this as a supervised task, based on sentence-level human annotation.…”
Section: Related Workmentioning
confidence: 99%
“…While many multi-hop QA models exist for HotpotQA and DROP, these are often equally complex models (Tu et al, 2020;Fang et al, 2020;Ran et al, 2019) focusing on just one of these datasets. Only on HotpotQA, where supporting sentences are annotated, can these models also produce post-hoc explanations, but these explanations are often not faithful and shown to be gameable (Trivedi et al, 2020). TMNs are able to produce explanations for multiple datasets without needing such annotations, making it more generalizable to future datasets.…”
Section: Related Workmentioning
confidence: 99%
“…19 On the HotpotQA dataset, MODULARQA is comparable to S-NMN but underperforms compared to DecompRC. Note that DecompRC can choose to answer some questions using single-hop reasoning and potentially exploit many artifacts in this dataset (Min et al, 2019a;Trivedi et al, 2020).…”
Section: Comparison To Dataset-specific Modelsmentioning
confidence: 99%
“…One of the crucial issues regarding the evaluation of multi-hop inference models is the possibility to achieve strong overall performance without using real compositional methods (Min et al, 2019;Chen and Durrett, 2019;Trivedi et al, 2020). Therefore, in order to evaluate multi-hop inference more explicitly, we break down the performance of each model with respect to the difficulty of accessing specific facts in an explanation via direct lexical overlap.…”
Section: Performance By Lexical Overlapmentioning
confidence: 99%