How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks

Kaushik, Divyansh

doi:10.48550/arxiv.1808.04926

Cited by 22 publications

(25 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Even in the latter case, blindfold baselines perform surprisingly close to existing state-of-the-art methods. We note that this finding is reminiscent of several recent works in both Computer Vision and Natural Language Processing, where researchers have found that statistical irregularities in the dataset can enable degenerate methods to perform surprisingly well [11,12,14,21].…”

Section: Introductionsupporting

confidence: 75%

See 1 more Smart Citation

Blindfold Baselines for Embodied QA

Anand¹,

Belilovsky²,

Kastner³

et al. 2018

Preprint

View full text Add to dashboard Cite

We explore blindfold (question-only) baselines for Embodied Question Answering. The EmbodiedQA task requires an agent to answer a question by intelligently navigating in a simulated environment, gathering necessary visual information only through first-person vision before finally answering. Consequently, a blindfold baseline which ignores the environment and visual information is a degenerate solution, yet we show through our experiments on the EQAv1 dataset that a simple question-only baseline achieves state-of-the-art results on the EmbodiedQA task in all cases except when the agent is spawned extremely close to the object.

show abstract

Section: Introductionsupporting

confidence: 75%

“…Similar observations were found on the Natural Language Inference (NLI) datasets, where methods ignoring the context and relying only on the hypothesis perform remarkably well [11,21]. Most recently, question-only and passage-only baselines on several QA datasets highlighted similar issues [14].…”

Section: Related Worksupporting

confidence: 61%

Blindfold Baselines for Embodied QA

Anand¹,

Belilovsky²,

Kastner³

et al. 2018

Preprint

View full text Add to dashboard Cite

show abstract

“…The majority of correspondence with human judgments can still be attributed to word overlap effects-disappearing once overlap is controlled-and improvements on the controlled settings are absent, very small, or highly localized to particular models, layers and representations. This outcome aligns with the increasing body of evidence that NLP datasets often do not require of models the level of linguistic sophistication that we might hope for-and in particular, our identification of a strong spurious cue in the PAWS-QQP dataset adds to the growing number of findings emphasizing that NLP datasets often have artifacts that can inflate performance (Poliak et al, 2018;Gururangan et al, 2018;Kaushik and Lipton, 2018).…”

Section: Discussionsupporting

confidence: 82%

On the Interplay Between Fine-tuning and Composition in Transformers

Ettinger

2021

Preprint

View full text Add to dashboard Cite

Pre-trained transformer language models have shown remarkable performance on a variety of NLP tasks.However, recent research has suggested that phrase-level representations in these models reflect heavy influences of lexical content, but lack evidence of sophisticated, compositional phrase information (Yu and Ettinger, 2020). Here we investigate the impact of fine-tuning on the capacity of contextualized embeddings to capture phrase meaning information beyond lexical content. Specifically, we fine-tune models on an adversarial paraphrase classification task with high lexical overlap, and on a sentiment classification task. After fine-tuning, we analyze phrasal representations in controlled settings following prior work. We find that fine-tuning largely fails to benefit compositionality in these representations, though training on sentiment yields a small, localized benefit for certain models. In follow-up analyses, we identify confounding cues in the paraphrase dataset that may explain the lack of composition benefits from that task, and we discuss potential factors underlying the localized benefits from sentiment training.

show abstract

“…It has been shown that NLI systems can often be broken merely by performing simple lexical substitutions (Glockner et al, 2018), and that they struggle with quantifiers (Geiger et al, 2018) and certain superficial syntactic properties (McCoy et al, 2019). In reading comprehension and question answering, Kaushik and Lipton (2018) showed that question-and passage-only models can perform surprisingly well, while Jia and Liang (2017) added adversarially constructed sentences to passages, leading to a drastic drop in performance. Many text classification datasets do not require sophisticated linguistic reasoning, as shown by the surprisingly good performance of random encoders (Wieting and Kiela, 2019).…”

Section: Related Workmentioning

confidence: 99%

Adversarial NLI: A New Benchmark for Natural Language Understanding

Nie

Williams

Dinan

et al. 2019

Preprint

View full text Add to dashboard Cite

We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-theart models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.

show abstract

How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks

Cited by 22 publications

References 0 publications

Blindfold Baselines for Embodied QA

Blindfold Baselines for Embodied QA

On the Interplay Between Fine-tuning and Composition in Transformers

Adversarial NLI: A New Benchmark for Natural Language Understanding

Contact Info

Product

Resources

About