Diversify Your Datasets: Analyzing Generalization via Controlled Variance in Adversarial Datasets

Rozen, Ohad; Shwartz, Vered; Aharoni, Roee; Dagan, Ido

doi:10.18653/v1/k19-1019

Cited by 33 publications

(31 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A related line of work has been analyzing the mathematical reasoning abilities of neural models over text (Wallace et al, 2019;Rozen et al, 2019;Ravichander et al, 2019), and on arithmetic problems (Saxton et al, 2019;Amini et al, 2019;Lample and Charton, 2020).…”

Section: Related Workmentioning

confidence: 99%

Injecting Numerical Reasoning Skills into Language Models

Geva¹,

Gupta²,

Berant³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

128

117

View full text Add to dashboard Cite

Large pre-trained language models (LMs) are known to encode substantial amounts of linguistic information. However, high-level reasoning skills, such as numerical reasoning, are difficult to learn from a language-modeling objective only. Consequently, existing models for numerical reasoning have used specialized architectures with limited flexibility. In this work, we show that numerical reasoning is amenable to automatic data generation, and thus one can inject this skill into pre-trained LMs, by generating large amounts of data, and training in a multi-task setup. We show that pre-training our model, GENBERT, on this data, dramatically improves performance on DROP (49.3 → 72.3 F 1 ), reaching performance that matches state-of-the-art models of comparable size, while using a simple and general-purpose encoder-decoder architecture. Moreover, GENBERT generalizes well to math word problem datasets, while maintaining high performance on standard RC tasks. Our approach provides a general recipe for injecting skills into large pre-trained LMs, whenever the skill is amenable to automatic data augmentation. * These authors contributed equally. (b) fine-tuning pre-trained LM numerical reasoning reading compr.

show abstract

Section: Related Workmentioning

confidence: 99%

Injecting Numerical Reasoning Skills into Language Models

Geva¹,

Gupta²,

Berant³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

128

117

View full text Add to dashboard Cite

show abstract

“…Regarding assessment of the behavior of modern language models, Linzen et al (2016), Goldberg (2019) investigated their syntactic capabilities by testing such models on subject-verb agreement tasks. Many studies of NLI tasks (Liu et al, 2019;Glockner et al, 2018;Poliak et al, 2018;Tsuchiya, 2018;McCoy et al, 2019;Rozen et al, 2019;Ross and Pavlick, 2019) have provided evaluation methodologies and found that current NLI models often fail on particular inference types, or that they learn undesired heuristics from the training set. In particular, recent works (Yanaka et al, 2019a,b;Richardson et al, 2020) have evaluated models on monotonicity, but did not focus on the ability to generalize to unseen combinations of patterns.…”

Section: Related Workmentioning

confidence: 99%

Do Neural Models Learn Systematicity of Monotonicity Inference in Natural Language?

Yanaka¹,

Mineshima²,

Bekki³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Despite the success of language models using neural networks, it remains unclear to what extent neural models have the generalization ability to perform inferences. In this paper, we introduce a method for evaluating whether neural models can learn systematicity of monotonicity inference in natural language, namely, the regularity for performing arbitrary inferences with generalization on composition. We consider four aspects of monotonicity inferences and test whether the models can systematically interpret lexical and logical phenomena on different training/test splits. A series of experiments show that three neural models systematically draw inferences on unseen combinations of lexical and logical phenomena when the syntactic structures of the sentences are similar between the training and test sets. However, the performance of the models significantly decreases when the structures are slightly changed in the test set while retaining all vocabularies and constituents already appearing in the training set. This indicates that the generalization ability of neural models is limited to cases where the syntactic structures are nearly the same as those in the training set.

show abstract

“…To mitigate these problems, Liu et al (2019a) introduced a systematic task-agnostic method to analyze datasets. Rozen et al (2019) further explain how to improve challenging datasets and why diversity matters. Geva et al (2019) suggest that the training and test data should be from exclusive annotators to avoid annotator bias.…”

Section: Related Workmentioning

confidence: 99%

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

Zhou

Nie

Tan

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

We find that the performance of state-of-theart models on Natural Language Inference (NLI) and Reading Comprehension (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? (2) Where does this instability come from? (3) How should we handle this instability and what are some potential solutions? For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve. We also observe lower-than-expected correlations between the analysis validation set and standard validation set, questioning the effectiveness of the current model-selection routine. Next, to answer the second question, we give both theoretical explanations and empirical evidence regarding the source of the instability, demonstrating that the instability mainly comes from high inter-example correlations within analysis sets. Finally, for the third question, we discuss an initial attempt to mitigate the instability and suggest guidelines for future work such as reporting the decomposed variance for more interpretable results and fair comparison across models. 1

show abstract

Diversify Your Datasets: Analyzing Generalization via Controlled Variance in Adversarial Datasets

Cited by 33 publications

References 18 publications

Injecting Numerical Reasoning Skills into Language Models

Injecting Numerical Reasoning Skills into Language Models

Do Neural Models Learn Systematicity of Monotonicity Inference in Natural Language?

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

Contact Info

Product

Resources

About