2018
DOI: 10.48550/arxiv.1808.04926
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

3
22
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(25 citation statements)
references
References 0 publications
3
22
0
Order By: Relevance
“…Even in the latter case, blindfold baselines perform surprisingly close to existing state-of-the-art methods. We note that this finding is reminiscent of several recent works in both Computer Vision and Natural Language Processing, where researchers have found that statistical irregularities in the dataset can enable degenerate methods to perform surprisingly well [11,12,14,21].…”
Section: Introductionsupporting
confidence: 75%
See 1 more Smart Citation
“…Even in the latter case, blindfold baselines perform surprisingly close to existing state-of-the-art methods. We note that this finding is reminiscent of several recent works in both Computer Vision and Natural Language Processing, where researchers have found that statistical irregularities in the dataset can enable degenerate methods to perform surprisingly well [11,12,14,21].…”
Section: Introductionsupporting
confidence: 75%
“…Similar observations were found on the Natural Language Inference (NLI) datasets, where methods ignoring the context and relying only on the hypothesis perform remarkably well [11,21]. Most recently, question-only and passage-only baselines on several QA datasets highlighted similar issues [14].…”
Section: Related Worksupporting
confidence: 61%
“…The majority of correspondence with human judgments can still be attributed to word overlap effects-disappearing once overlap is controlled-and improvements on the controlled settings are absent, very small, or highly localized to particular models, layers and representations. This outcome aligns with the increasing body of evidence that NLP datasets often do not require of models the level of linguistic sophistication that we might hope for-and in particular, our identification of a strong spurious cue in the PAWS-QQP dataset adds to the growing number of findings emphasizing that NLP datasets often have artifacts that can inflate performance (Poliak et al, 2018;Gururangan et al, 2018;Kaushik and Lipton, 2018).…”
Section: Discussionsupporting
confidence: 82%
“…It has been shown that NLI systems can often be broken merely by performing simple lexical substitutions (Glockner et al, 2018), and that they struggle with quantifiers (Geiger et al, 2018) and certain superficial syntactic properties (McCoy et al, 2019). In reading comprehension and question answering, Kaushik and Lipton (2018) showed that question-and passage-only models can perform surprisingly well, while Jia and Liang (2017) added adversarially constructed sentences to passages, leading to a drastic drop in performance. Many text classification datasets do not require sophisticated linguistic reasoning, as shown by the surprisingly good performance of random encoders (Wieting and Kiela, 2019).…”
Section: Related Workmentioning
confidence: 99%