Proceedings of the Workshop on Machine Reading for Question Answering 2018
DOI: 10.18653/v1/w18-2602
|View full text |Cite
|
Sign up to set email alerts
|

Systematic Error Analysis of the Stanford Question Answering Dataset

Abstract: We analyzed the outputs of multiple question answering (QA) models applied to the Stanford Question Answering Dataset (SQuAD) to identify the core challenges for QA systems on this data set. Through an iterative process, challenging aspects were hypothesized through qualitative analysis of the common error cases. A classifier was then constructed to predict whether SQuAD test examples were likely to be difficult for systems to answer based on features associated with the hypothesized aspects. The classifier's … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
16
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(19 citation statements)
references
References 9 publications
(9 reference statements)
2
16
1
Order By: Relevance
“…Weissenborn, Wiese, and Seiffe (2017) found a feature indicating if a word appears in the question important, and suggested that questions can be answered with some rules that rely only on superficial features. Rondeau and Hazen (2018) validated the suggestion by a series of systematic experiments. However, the conversational question answering systems have been rarely explored.…”
Section: Related Worksupporting
confidence: 59%
“…Weissenborn, Wiese, and Seiffe (2017) found a feature indicating if a word appears in the question important, and suggested that questions can be answered with some rules that rely only on superficial features. Rondeau and Hazen (2018) validated the suggestion by a series of systematic experiments. However, the conversational question answering systems have been rarely explored.…”
Section: Related Worksupporting
confidence: 59%
“…But suppose there's a new subject that wants to challenge Ken; they are not going to reliably dethrone Ken until their skill θ c is greater than six. This is a more mathematical formulation of the "easy" and "hard" dataset splits in question answering (Sugawara et al, 2018;Rondeau and Hazen, 2018;Sen and Saffari, 2020). In IRT-feas, this recapitulates the observation of Boyd-Graber and Börschinger (2020) that annotation error can hinder effective leaderboards.…”
Section: Examples Are Not Equally Usefulmentioning
confidence: 74%
“…If you find them boring, repetitive, or uninteresting, so will crowdworkers. If you can find shortcuts to answer questions (Rondeau and Hazen, 2018;Kaushik and Lipton, 2018), so will a computer.…”
Section: Are We Having Fun?mentioning
confidence: 99%