Language Models Use Monotonicity to Assess NPI Licensing

Jumelet, Jaap; Denić, Milica; Szymanik, Jakub; Hupkes, Dieuwke; Steinert‐Threlkeld, Shane

doi:10.18653/v1/2021.findings-acl.439

Cited by 13 publications

(12 citation statements)

References 37 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the utility of probing tasks. Many recent papers provide compelling evidence that BERT contains a surprising amount of syntax, semantics, and world knowledge (Giulianelli et al, 2018;Rogers et al, 2020;Lakretz et al, 2019;Jumelet et al, 2019Jumelet et al, , 2021. Many of these works involve diagnostic classifiers or parametric probes, i.e.…”

Section: Related Workmentioning

confidence: 99%

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Sinha¹,

Jia²,

Hupkes³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Self Cite

106

View full text Add to dashboard Cite

A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks mostly due to their ability to model higher-order word cooccurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and we show that these models still achieve high accuracy after finetuning on many downstream tasks -including tasks specifically designed to be challenging for models that ignore word order. Our models also perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pretraining, and they underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.

show abstract

Section: Related Workmentioning

confidence: 99%

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Sinha¹,

Jia²,

Hupkes³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Self Cite

106

View full text Add to dashboard Cite

show abstract

“…Most of the existing methods inspect a pre-specified model component (e.g., individual BERT layers) in a top-down manner. A typical approach first takes aim at specific linguistic phenomena that would be captured by the target components, and then trains a probing classifier that predicts the chosen linguistic phenomena from the target components (Bau et al, 2018;Giulianelli et al, 2018;Dalvi et al, 2019;Lakretz et al, 2019;Kovaleva et al, 2019;Goldberg, 2019;Petroni et al, 2019;Hewitt and Manning, 2019;Jawahar et al, 2019;Durrani et al, 2020;Zhou and Srikumar, 2021;Cao et al, 2021;Jumelet et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Exploratory Model Analysis Using Data-Driven Neuron Representations

Oba¹,

Yoshinaga²,

Toyoda³

2021

Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

Probing classifiers have been extensively used to inspect whether a model component captures specific linguistic phenomena. This topdown approach is, however, costly when we have no probable hypothesis on the association between the target model component and phenomena. In this study, aiming to provide a flexible, exploratory analysis of a neural model at various levels ranging from individual neurons to the model as a whole, we present a bottomup approach to inspect the target neural model by using neuron representations obtained from a massive corpus of text. We first feed massive amount of text to the target model and collect sentences that strongly activate each neuron. We then abstract the collected sentences to obtain neuron representations that help us interpret the corresponding neurons; we augment the sentences with linguistic annotations (e.g., part-of-speech tags) and various metadata (e.g., topic and sentiment), and apply pattern mining and clustering techniques to the augmented sentences. We demonstrate the utility of our method by inspecting the pre-trained BERT. Our exploratory analysis reveals that i) specific phrases and domains of text are captured by individual neurons in BERT, ii) a group of neurons simultaneously capture the same linguistic phenomena, and iii) deeper-level layers capture more specific linguistic phenomena.

show abstract

“…A range of tests for causal language models consider if a model can represent a particular linguistic phenomenon (i.e., subject-verb-agreement, filler gap dependencies, negative polarity items Jumelet et al, 2021Jumelet et al, , 2019Wilcox et al, 2018;Gulordava et al, 2018), by measuring whether that model assigns a higher probability to a grammatical sentence involving that phenomenon than to its minimally different ungrammatical counterpart. In such tests, the comparison of probabilities is often focused on the probability of a single token -for instance, the probability of the correct and incorrect verb-form in a long sentence (Linzen et al, 2016).…”

Section: Methodsmentioning

confidence: 99%

Sparse Interventions in Language Models with Differentiable Masking

Cao¹,

Schmid²,

Hupkes³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

There has been a lot of interest in understanding what information is captured by hidden representations of language models (LMs). Typically, interpretation methods i) do not guarantee that the model actually uses the encoded information, and ii) do not discover small subsets of neurons responsible for a considered phenomenon. Inspired by causal mediation analysis, we propose a method that discovers within a neural LM a small subset of neurons responsible for a particular linguistic phenomenon, i.e., subsets causing a change in the corresponding token emission probabilities. We use a differentiable relaxation to approximately search through the combinatorial space. An L 0 regularization term ensures that the search converges to discrete and sparse solutions. We apply our method to analyze subject-verb number agreement and gender bias detection in LSTMs. We observe that it is fast and finds better solutions than the alternative (REINFORCE). Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons that do not play any other discernible role.

show abstract

Language Models Use Monotonicity to Assess NPI Licensing

Cited by 13 publications

References 37 publications

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Exploratory Model Analysis Using Data-Driven Neuron Representations

Sparse Interventions in Language Models with Differentiable Masking

Contact Info

Product

Resources

About