Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning

Trivedi, Harsh; Balasubramanian, Niranjan; Khot, Tushar; Sabharwal, Ashish

doi:10.18653/v1/2020.emnlp-main.712

Cited by 15 publications

(37 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our focus is to identify and alleviate reasoning shortcuts in multi-hop QA, without evidence annotations. Models taking shortcuts were widely observed from various tasks, such as object detection (Singh et al, 2020), NLI (Tu et al, 2020), and also for our target task of multi-hop QA (Min et al, 2019;Chen and Durrett, 2019;Trivedi et al, 2020), where models learn simple heuristic rules, answering correctly but without proper reasoning.…”

Section: Related Workmentioning

confidence: 99%

“…To mitigate the effect of shortcuts, adversarial examples (Jiang and Bansal, 2019) can be generated, or alternatively, models can be robustifed (Trivedi et al, 2020) with additional supervision for paragraph-level "sufficiency" -to identify whether a pair of two paragraphs are sufficient for right reasoning or not, which reduces shortcuts on a single paragraph. While the binary classification for paragraph-sufficiency is relatively easy (96.7 F1 in Trivedi et al (2020)), our target of capturing a finer-grained sentence-evidentiality is more challenging. Existing QA model (Nie et al, 2019;Groeneveld et al, 2020) treats this as a supervised task, based on sentence-level human annotation.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Robustifying Multi-hop QA through Pseudo-Evidentiality Training

Lee¹,

Hwang²,

Han³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

This paper studies the bias problem of multihop question answering models, of answering correctly without correct reasoning. One way to robustify these models is by supervising to not only answer right, but also with right reasoning chains. An existing direction is to annotate reasoning chains to train models, requiring expensive additional annotations. In contrast, we propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences, without such annotations. Instead, we compare counterfactual changes in answer confidence with and without evidence sentences, to generate "pseudo-evidentiality" annotations. We validate our proposed model on an original set and challenge set in HotpotQA, showing that our method is accurate and robust in multi-hop reasoning.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Robustifying Multi-hop QA through Pseudo-Evidentiality Training

Lee¹,

Hwang²,

Han³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…While many multi-hop QA models exist for HotpotQA and DROP, these are often equally complex models (Tu et al, 2020;Fang et al, 2020;Ran et al, 2019) focusing on just one of these datasets. Only on HotpotQA, where supporting sentences are annotated, can these models also produce post-hoc explanations, but these explanations are often not faithful and shown to be gameable (Trivedi et al, 2020). TMNs are able to produce explanations for multiple datasets without needing such annotations, making it more generalizable to future datasets.…”

Section: Related Workmentioning

confidence: 99%

“…19 On the HotpotQA dataset, MODULARQA is comparable to S-NMN but underperforms compared to DecompRC. Note that DecompRC can choose to answer some questions using single-hop reasoning and potentially exploit many artifacts in this dataset (Min et al, 2019a;Trivedi et al, 2020).…”

Section: Comparison To Dataset-specific Modelsmentioning

confidence: 99%

Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models

Khot¹,

Khashabi²,

Richardson³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

View full text Add to dashboard Cite

We propose a general framework called Text Modular Networks (TMNs) for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models. To ensure solvability of simpler tasks, TMNs learn the textual input-output behavior (i.e., language) of existing models through their datasets. This differs from prior decomposition-based approaches which, besides being designed specifically for each complex task, produce decompositions independent of existing submodels. Specifically, we focus on Question Answering (QA) and show how to train a next-question generator to sequentially produce sub-questions targeting appropriate submodels, without additional human annotation. These sub-questions and answers provide a faithful natural language explanation of the model's reasoning. We use this framework to build MODULARQA, 1 a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator. Our experiments show that MODULARQA is more versatile than existing explainable systems for DROP and Hot-potQA datasets, is more robust than stateof-the-art blackbox (uninterpretable) systems, and generates more understandable and trustworthy explanations compared to prior work.

show abstract

“…One of the crucial issues regarding the evaluation of multi-hop inference models is the possibility to achieve strong overall performance without using real compositional methods (Min et al, 2019;Chen and Durrett, 2019;Trivedi et al, 2020). Therefore, in order to evaluate multi-hop inference more explicitly, we break down the performance of each model with respect to the difficulty of accessing specific facts in an explanation via direct lexical overlap.…”

Section: Performance By Lexical Overlapmentioning

confidence: 99%

Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

2021

View full text Add to dashboard Cite

show abstract

Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning

Cited by 15 publications

References 23 publications

Robustifying Multi-hop QA through Pseudo-Evidentiality Training

Robustifying Multi-hop QA through Pseudo-Evidentiality Training

Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models

Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

Contact Info

Product

Resources

About