Evaluation Metrics for Machine Reading Comprehension: Prerequisite
            Skills and Readability

Sugawara, Saku; Kido, Yusuke; Yokono, Hikaru; Aizawa, Akiko

doi:10.18653/v1/p17-1075

Cited by 32 publications

(36 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Weston et al (2015) defined 20 skills as a set of toy tasks. Sugawara et al (2017) also organized 10 prerequisite skills for MRC. LoBue and Yates (2011) and Sammons et al (2010) analyzed entailment phenomena using detailed classifications in RTE.…”

Section: Related Workmentioning

confidence: 99%

What Makes Reading Comprehension Questions Easier?

Sugawara¹,

Inui²,

Sekine³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

101

107

View full text Add to dashboard Cite

A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multiple choice). We propose to employ simple heuristics to split each dataset into easy and hard subsets and examine the performance of two baseline models for each of the subsets. We then manually annotate questions sampled from each subset with both validity and requisite reasoning skills to investigate which skills explain the difference between easy and hard questions. From this study, we observed that (i) the baseline performances for the hard subsets remarkably degrade compared to those of entire datasets, (ii) hard questions require knowledge inference and multiple-sentence reasoning in comparison with easy questions, and (iii) multiplechoice questions tend to require a broader range of reasoning skills than answer extraction and description questions. These results suggest that one might overestimate recent advances in MRC.

show abstract

Section: Related Workmentioning

confidence: 99%

What Makes Reading Comprehension Questions Easier?

Sugawara¹,

Inui²,

Sekine³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

101

107

View full text Add to dashboard Cite

show abstract

“…Finally, we will describe the human analog of the models' strategy, followed by our conclusions. Sugawara et al (2017) evaluated various datasets, in particular SQuAD, to determine how many human reading skills were required to answer questions. They described SQuAD as "difficult to read but easy to answer" for humans, finding that SQuAD requires only a few simple skills.…”

Section: Text Organizationmentioning

confidence: 99%

“…Reading the passage will prime the question creators towards questions based on interrogative paraphrases of the passage. As noted by Sugawara et al (2017), "SQuAD was difficult to read," which should further magnify this effect: when the passage is hard to read, it is easier and faster to scan it for a sentence stating a fact and to reformulate that sentence as a question. In particular, since crowdworkers are not motivated by a genuine need for information, we can expect them to use the first question that came to mind.…”

Section: Priming During Data Collectionmentioning

confidence: 99%

Systematic Error Analysis of the Stanford Question Answering Dataset

Rondeau¹,

Hazen

2018

Proceedings of the Workshop on Machine Reading for Question Answering

View full text Add to dashboard Cite

We analyzed the outputs of multiple question answering (QA) models applied to the Stanford Question Answering Dataset (SQuAD) to identify the core challenges for QA systems on this data set. Through an iterative process, challenging aspects were hypothesized through qualitative analysis of the common error cases. A classifier was then constructed to predict whether SQuAD test examples were likely to be difficult for systems to answer based on features associated with the hypothesized aspects. The classifier's performance was used to accept or reject each aspect as an indicator of difficulty. With this approach, we ensured that our hypotheses were systematically tested and not simply accepted based on our pre-existing biases. Our explanations are not accepted based on human evaluation of individual examples. This process also enabled us to identify the primary QA strategy learned by the models, i.e., systems determined the acceptable answer type for a question and then selected the acceptable answer span of that type containing the highest density of words present in the question within its local vicinity in the passage.

show abstract

“…1 Support for this is given in Sugawara et al (2017), who show that Who-did-what dataset, for example, requires on average a larger number of reading skills than SQuAD (Rajpurkar et al, 2016) and MCTest (Richardson et al, 2013).…”

Section: Related Datasetsmentioning

confidence: 99%

“…In annotating the skills, we followed the categorization by Sugawara et al (2017) . Bridging: inference through grammatical and lexical knowledge (synonymy, idioms etc).…”

Section: B List Of Skills With Selected Examplesmentioning

confidence: 99%

CliCR: a Dataset of Clinical Case Reports for Machine Reading Comprehension

Šuster¹,

Daelemans²

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

We present a new dataset for machine comprehension in the medical domain. Our dataset uses clinical case reports with around 100,000 gap-filling queries about these cases. We apply several baselines and state-of-the-art neural readers to the dataset, and observe a considerable gap in performance (20% F1) between the best human and machine readers. We analyze the skills required for successful answering and show how reader performance varies depending on the applicable skills. We find that inferences using domain knowledge and object tracking are the most frequently required skills, and that recognizing omitted information and spatio-temporal reasoning are the most difficult for the machines. * We provide the information about accessing the dataset, as well as the code for the experiments, at http://github. com/clips/clicr.

show abstract

Evaluation Metrics for Machine Reading Comprehension: Prerequisite Skills and Readability

Cited by 32 publications

References 25 publications

What Makes Reading Comprehension Questions Easier?

What Makes Reading Comprehension Questions Easier?

Systematic Error Analysis of the Stanford Question Answering Dataset

CliCR: a Dataset of Clinical Case Reports for Machine Reading Comprehension

Contact Info

Product

Resources

About