Mechanisms for handling nested dependencies in neural-network language models and humans

Lakretz, Yair; Hupkes, Dieuwke; Vergallito, Alessandra; Marelli, Marco; Baroni, Marco; Dehaene, Stanislas

doi:10.1016/j.cognition.2021.104699

Cited by 32 publications

(28 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, we follow Pimentel et al (2020a) and use both simple (linear) and complex (non-linear) models, as well as "complex" tasks (dependency parsing). As an alternative to parametric probes, stimulus-based non-parametric probing (Linzen et al, 2016;Jumelet and Hupkes, 2018;Marvin and Linzen, 2018;Gulordava et al, 2018a;Warstadt et al, 2019aWarstadt et al, , 2020aEttinger, 2020;Lakretz et al, 2021) has been used to show that even without a learned probe, BERT can predict syntactic properties with high confidence (Goldberg, 2019;Wolf, 2019). We use this class of non-parametric probes to investigate RoBERTa's ability to learn word order during pre-training.…”

Section: Related Workmentioning

confidence: 99%

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Sinha¹,

Jia²,

Hupkes³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks mostly due to their ability to model higher-order word cooccurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and we show that these models still achieve high accuracy after finetuning on many downstream tasks -including tasks specifically designed to be challenging for models that ignore word order. Our models also perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pretraining, and they underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.

show abstract

Section: Related Workmentioning

confidence: 99%

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Sinha¹,

Jia²,

Hupkes³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Many recent studies have treated neural LMs and contextualized word prediction models-primarily LSTM LMs (Sundermeyer et al, 2012), GPT-2 (Radford et al, 2019), and BERT (Devlin et al, 2019)-as psycholinguistic subjects to be studied behaviorally (Linzen et al, 2016;Gulordava et al, 2018;Goldberg, 2019). Some have studied whether models prefer grammatical completions in subjectverb agreement contexts (Marvin and Linzen, 2018;van Schijndel et al, 2019;Goldberg, 2019;Mueller et al, 2020;Lakretz et al, 2021;, as well as in filler-gap dependencies (Wilcox et al, 2018. These are based on the approach of Linzen et al (2016), where a model's ability to syntactically generalize is measured by its ability to choose the correct inflection in difficult structural contexts instantiated by tokens that the model has not seen together during training.…”

Section: Targeted Syntactic Evaluationmentioning

confidence: 99%

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models

Finlayson

Mueller

Gehrmann³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Targeted syntactic evaluations have demonstrated the ability of language models to perform subject-verb agreement given difficult contexts. To elucidate the mechanisms by which the models accomplish this behavior, this study applies causal mediation analysis to pre-trained neural language models. We investigate the magnitude of models' preferences for grammatical inflections, as well as whether neurons process subject-verb agreement similarly across sentences with different syntactic structures. We uncover similarities and differences across architectures and model sizesnotably, that larger models do not necessarily learn stronger preferences. We also observe two distinct mechanisms for producing subject-verb agreement depending on the syntactic structure of the input sentence. Finally, we find that language models rely on similar sets of neurons when given sentences with similar syntactic structure. * Equal contribution. † Work done while visiting Google Research. ‡ Supported by the Viterbi Fellowship in the Center for Computer Engineering at the Technion.

show abstract

“…To reduce the cost of training a classifier, Zhou and Srikumar (2021) indirectly predict the performance of probing classifiers by analyzing how the labeled data is represented in the vector space. Some studies identify neurons which make a huge contribution to solving the desired task, by looking at the performance of the task when the activation of neurons is forcibly controlled (Bau et al, 2018;Lakretz et al, 2021;Cao et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

Exploratory Model Analysis Using Data-Driven Neuron Representations

Oba¹,

Yoshinaga²,

Toyoda³

2021

Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

Probing classifiers have been extensively used to inspect whether a model component captures specific linguistic phenomena. This topdown approach is, however, costly when we have no probable hypothesis on the association between the target model component and phenomena. In this study, aiming to provide a flexible, exploratory analysis of a neural model at various levels ranging from individual neurons to the model as a whole, we present a bottomup approach to inspect the target neural model by using neuron representations obtained from a massive corpus of text. We first feed massive amount of text to the target model and collect sentences that strongly activate each neuron. We then abstract the collected sentences to obtain neuron representations that help us interpret the corresponding neurons; we augment the sentences with linguistic annotations (e.g., part-of-speech tags) and various metadata (e.g., topic and sentiment), and apply pattern mining and clustering techniques to the augmented sentences. We demonstrate the utility of our method by inspecting the pre-trained BERT. Our exploratory analysis reveals that i) specific phrases and domains of text are captured by individual neurons in BERT, ii) a group of neurons simultaneously capture the same linguistic phenomena, and iii) deeper-level layers capture more specific linguistic phenomena.

show abstract

Mechanisms for handling nested dependencies in neural-network language models and humans

Cited by 32 publications

References 52 publications

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models

Exploratory Model Analysis Using Data-Driven Neuron Representations

Contact Info

Product

Resources

About