Most "babies" are "little" and most "problems" are "huge": Compositional Entailment in Adjective-Nouns

Pavlick, Ellie; Callison-Burch, Chris

doi:10.18653/v1/p16-1204

Cited by 34 publications

(42 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As target datasets, we use the 10 datasets investigated by Poliak et al (2018b) in their hypothesisonly study, plus two test sets: GLUE's diagnostic test set, which was carefully constructed to not contain hypothesis-biases (Wang et al, 2018), and SNLI-hard, a subset of the SNLI test set that is thought to have fewer biases (Gururangan et al, 2018). The target datasets include humanjudged datasets that used automatic methods to pair premises and hypotheses, and then relied on humans to label the pairs: SCITAIL , ADD-ONE-RTE (Pavlick & Callison-Burch, 2016), Johns Hopkins Ordinal Common-α β 0.1 0.25 0.5 1 2.5 5 0.1 50 2.5 50 50 100 75 100 100 3 50 100 100 100 100 100 3.5 100 100 100 100 100 100 4 100 100 100 100 100 100 5 100 100 100 100 100 100 10 100 100 100 100 100 100 20 100 100 100 100 100 100 Reisinger et al, 2015). 9 As many of these datasets have different label spaces than SNLI, we define a mapping (Appendix A.1) from our models' predictions to each target dataset's labels.…”

Section: Methodsmentioning

confidence: 99%

Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Belinkov¹,

Poliak²,

Shieber³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Natural Language Inference (NLI) datasets often contain hypothesis-only biases-artifacts that allow models to achieve non-trivial performance without learning whether a premise entails a hypothesis. We propose two probabilistic methods to build models that are more robust to such biases and better transfer across datasets. In contrast to standard approaches to NLI, our methods predict the probability of a premise given a hypothesis and NLI label, discouraging models from ignoring the premise. We evaluate our methods on synthetic and existing NLI datasets by training on datasets containing biases and testing on datasets containing no (or different) hypothesis-only biases. Our results indicate that these methods can make NLI models more robust to dataset-specific artifacts, transferring better than a baseline architecture in 9 out of 12 NLI datasets. Additionally, we provide an extensive analysis of the interplay of our methods with known biases in NLI datasets, as well as the effects of encouraging models to ignore biases and fine-tuning on target datasets. 1 * * Equal contribution 1 Our code is available at https://github.com/ azpoliak/robust-nli.2 This hypothesis contradicts the premise and would likely not be inferred.

show abstract

Section: Methodsmentioning

confidence: 99%

Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Belinkov¹,

Poliak²,

Shieber³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Although NLI is generally cast in informal terms that embrace the indeterminacy of such reasoning, the task nonetheless manifests a number of very predictable reasoning patterns. For example, systematic manipulations of the lexical meanings (Glockner et al, 2018), syntactic constructions (Nie et al, 2019a), and contextual assumptions (Pavlick and Callison-Burch, 2016) have systematic effects on the correct labels. These patterns present crisp, motivated learning targets that we can leverage to not only evaluate the ability of NLI models to learn robust solutions, but also to analyze the internal dynamics of successful models.…”

Section: Introductionmentioning

confidence: 99%

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

Geiger¹,

Richardson²,

Potts³

2020

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

We address whether neural models for Natural Language Inference (NLI) can learn the compositional interactions between lexical entailment and negation, using four methods: the behavioral evaluation methods of (1) challenge test sets and (2) systematic generalization tasks, and the structural evaluation methods of (3) probes and ( 4) interventions. To facilitate this holistic evaluation, we present Monotonicity NLI (MoNLI), a new naturalistic dataset focused on lexical entailment and negation. In our behavioral evaluations, we find that models trained on general-purpose NLI datasets fail systematically on MoNLI examples containing negation, but that MoNLI fine-tuning addresses this failure. In our structural evaluations, we look for evidence that our top-performing BERT-based model has learned to implement the monotonicity algorithm behind MoNLI. Probes yield evidence consistent with this conclusion, and our intervention experiments bolster this, showing that the causal dynamics of the model mirror the causal dynamics of this algorithm on subsets of MoNLI. This suggests that the BERT model at least partially embeds a theory of lexical entailment and negation at an algorithmic level.

show abstract

“…Past work has also evaluated commonsense capabilities in neural models. Pavlick and Callison-Burch (2016) investigate the related problem of entailment in adjective-nouns, and show surprising negative results for neural NLI models. Wang et al (2018) showed that models based on distributional semantics without explicit external knowledge perform poorly at predicting physical plausibility of actions.…”

Section: Related Workmentioning

confidence: 99%

Do Nuclear Submarines Have Nuclear Captains? A Challenge Dataset for Commonsense Reasoning over Adjectives and Objects

Mullenbach¹,

Gordon

Peng³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

How do adjectives project from a noun to its parts? If a motorcycle is red, are its wheels red? Is a nuclear submarine's captain nuclear? These questions are easy for humans to judge using our commonsense understanding of the world, but are difficult for computers. To attack this challenge, we crowdsource a set of human judgments that answer the English-language question "Given a whole described by an adjective, does the adjective also describe a given part?" We build strong baselines for this task with a classification approach. Our findings indicate that, despite the recent successes of large language models on tasks aimed to assess commonsense knowledge, these models do not greatly outperform simple word-level models based on pre-trained word embeddings. This provides evidence that the amount of commonsense knowledge encoded in these language models does not extend far beyond that already baked into the word embeddings. Our dataset will serve as a useful testbed for future research in commonsense reasoning, especially as it relates to adjectives and objects.

show abstract

Most "babies" are "little" and most "problems" are "huge": Compositional Entailment in Adjective-Nouns

Cited by 34 publications

References 18 publications

Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

Do Nuclear Submarines Have Nuclear Captains? A Challenge Dataset for Commonsense Reasoning over Adjectives and Objects

Contact Info

Product

Resources

About