Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1215
|View full text |Cite
|
Sign up to set email alerts
|

Adversarial Examples for Evaluating Reading Comprehension Systems

Abstract: Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear.To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer system… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

14
1,101
1
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 1,140 publications
(1,191 citation statements)
references
References 29 publications
14
1,101
1
1
Order By: Relevance
“…With the help of the large-scale datasets, the RC models evolve rapidly and even outperform humans on some tasks (Cui et al, 2017;Seo et al, 2017;Xiong et al, 2018;Radford, 2018;Hu et al, 2018). However, this does not imply that machine has acquired real intelligence, as the machine can be fooled easily on artificial examples (Jia and Liang, 2017).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…With the help of the large-scale datasets, the RC models evolve rapidly and even outperform humans on some tasks (Cui et al, 2017;Seo et al, 2017;Xiong et al, 2018;Radford, 2018;Hu et al, 2018). However, this does not imply that machine has acquired real intelligence, as the machine can be fooled easily on artificial examples (Jia and Liang, 2017).…”
Section: Related Workmentioning
confidence: 99%
“…Different from the previously discussed idea of embedding level perturbations, Jia and Liang (2017) generated adversarial examples for MRC tasks at the word token level. They introduced the AddSent algorithm, which generates adversarial examples by appending distracting sentences to the input passages.…”
Section: Related Workmentioning
confidence: 99%
“…For example, showed that adversarial perturbation in candidate answers results in a significant drop in performance of a few state-of-art science QA systems. Similarly, Jia and Liang (2017) show that adding an adversarially selected sentence to the instances in the SQuAD datasets drastically reduces the performance of many of the existing baselines. Chen et al (2016) show that in the CNN/Daily Mail datasets, "the required reasoning and inference level .…”
Section: Introductionmentioning
confidence: 94%
“…In computer vision, it is common to adversarially train on artificially noisy examples to create a more robust model (Goodfellow et al, 2015;Szegedy et al, 2014). However, in the case of question answering, Jia and Liang (2017) show that training on one perturbation does not result in generalization to similar perturbations, revealing a need for models with stronger generalization capabilities. Similarly, adversarial testing has shown that strong models for the SNLI dataset (Bowman et al, 2015a) have significant holes in their knowledge of lexical and compositional semantics (Glockner et al, 2018;Naik et al, 2018;Nie et al, 2018;Yanaka et al, 2019;Dasgupta et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…However, in the case of question answering, Jia and Liang (2017) show that training on one perturbation does not result in generalization to similar perturbations, revealing a need for models with stronger generalization capabilities. Similarly, adversarial testing has shown that strong models for the SNLI dataset (Bowman et al, 2015a) have significant holes in their knowledge of lexical and compositional semantics (Glockner et al, 2018;Naik et al, 2018;Nie et al, 2018;Yanaka et al, 2019;Dasgupta et al, 2018). In addition, a number of recent papers suggest that even top models exploit dataset artifacts to achieve good quantitative results (Poliak et al, 2018;Gururangan et al, 2018;Tsuchiya, 2018), which further emphasizes the need to go beyond naturalistic evaluations.…”
Section: Related Workmentioning
confidence: 99%