Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

Dasigi, Pradeep; Liu, Nelson F.; Marasović, Ana; Smith, Noah A.; Gardner, Matt

doi:10.18653/v1/d19-1606

Cited by 111 publications

(67 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Extractive QA (EX). Among the datasets in this popular format, we adopt SQuAD 1.1 (Rajpurkar et al, 2016), SQuAD 2 (Rajpurkar et al, 2018), NewsQA (Trischler et al, 2017), Quoref (Dasigi et al, 2019), ROPES (Lin et al, 2019).…”

Section: Datasetsmentioning

confidence: 99%

UNIFIEDQA: Crossing Format Boundaries with a Single QA System

Khashabi¹,

Min²,

Khot³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

358

327

View full text Add to dashboard Cite

Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UNIFIEDQA, that performs well across 20 QA datasets spanning 4 diverse formats. UNIFIEDQA performs on par with 8 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UNIFIEDQA performs surprisingly well, showing strong generalization from its out-offormat training data. Finally, fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 10 factoid and commonsense QA datasets, establishing UNIFIEDQA as a strong starting point for building QA systems. 1 1 https://github.com/allenai/unifiedqa Extractive [SQuAD] Question: At what speed did the turbine operate? Context: (Nikola_Tesla) On his 50th birthday in 1906, Tesla demonstrated his 200 horsepower (150 kilowatts) 16,000 rpm bladeless turbine. ... Gold answer: 16,000 rpm Multiple-Choice [ARC-challenge] Question: What does photosynthesis produce that helps plants grow? Candidate Answers: (A) water (B) oxygen (C) protein (D) sugar Gold answer: sugar Yes/No [BoolQ] Question: Was America the first country to have a president? Context: (President) The first usage of the word president to denote the highest official in a government was during the Commonwealth of England ... Gold answer: no Abstractive [NarrativeQA]Question: What does a drink from narcissus's spring cause the drinker to do? Context: Mercury has awakened Echo, who weeps for Narcissus, and states that a drink from Narcissus's spring causes the drinkers to "Grow dotingly enamored of themselves." ...

show abstract

Section: Datasetsmentioning

confidence: 99%

UNIFIEDQA: Crossing Format Boundaries with a Single QA System

Khashabi¹,

Min²,

Khot³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

358

327

View full text Add to dashboard Cite

show abstract

“…To make our observations and conclusions as general as possible, we experiment over a diverse range of QA datasets with broad domain coverage over questions regarding both factual and commonsense knowledge (Khashabi et al, 2020;Hendrycks et al, 2020;Rajpurkar et al, 2016Rajpurkar et al, , 2018Trischler et al, 2017;Dasigi et al, 2019;Lin et al, 2019;Richardson et al, 2013;Lai et al, 2017;Mihaylov et al, 2018;Talmor et al, 2019b;Bisk et al, 2020;Sakaguchi et al, 2020). We list all the datasets we used in Table 2 and their corresponding domain.…”

Section: Lm-based Question Answeringmentioning

confidence: 99%

How Can We Know What Language Models Know?

Jiang

Araki³

et al. 2020

Transactions of the Association for Computational Linguistics

689

490

View full text Add to dashboard Cite

Recent work has presented intriguing results examining the knowledge contained in language models (LMs) by having the LM fill in the blanks of prompts such as “ Obama is a __ by profession”. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as “ Obama worked as a __ ” may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://github.com/jzbjyb/LPAQA .

show abstract

“…ing comprehension tasks, such as SQuAD 2.0 (Rajpurkar et al, 2018), DROP (Dua et al, 2019b), or Quoref (Dasigi et al, 2019), evaluate models using a relatively simpler setup where all the information required to answer the questions (including judging them as being unanswerable) is provided in the associated contexts. While this setup has led to significant advances in reading comprehension (Ran et al, 2019;Zhang et al, 2020), the tasks are still limited since they do not evaluate the capability of models at identifying precisely what information, if any, is missing to answer a question, and where that information might be found.…”

Section: Introductionmentioning

confidence: 99%

IIRC: A Dataset of Incomplete Information Reading Comprehension Questions

Ferguson¹,

Gardner²,

Hajishirzi³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

Humans often have to read multiple documents to address their information needs. However, most existing reading comprehension (RC) tasks only focus on questions for which the contexts provide all the information required to answer them, thus not evaluating a system's performance at identifying a potential lack of sufficient information and locating sources for that information. To fill this gap, we present a dataset, IIRC, with more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents. The questions were written by crowd workers who did not have access to any of the linked documents, leading to questions that have little lexical overlap with the contexts where the answers appear. This process also gave many questions without answers, and those that require discrete reasoning, increasing the difficulty of the task. We follow recent modeling work on various reading comprehension datasets to construct a baseline model for this dataset, finding that it achieves 31.1% F1 on this task, while estimated human performance is 88.4%. The dataset, code for the baseline system, and a leaderboard can be found at https://allennlp.org/iirc. * Work done as an intern at the Allen Institute for AI.

show abstract

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

Cited by 111 publications

References 21 publications

UNIFIEDQA: Crossing Format Boundaries with a Single QA System

UNIFIEDQA: Crossing Format Boundaries with a Single QA System

How Can We Know What Language Models Know?

IIRC: A Dataset of Incomplete Information Reading Comprehension Questions

Contact Info

Product

Resources

About