Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Sinha, Koustuv; Jia, Robin; Hupkes, Dieuwke; Pineau, Joëlle; Williams, Adina; Kiela, Douwe

doi:10.18653/v1/2021.emnlp-main.230

Cited by 106 publications

(98 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, several works report that models fine-tuned on such perturbed data still produce high confidence predictions and perform close to their counterparts on many tasks, including the GLUE benchmark (Ahmad et al, 2019;Sinha et al, 2020;Liu et al, 2021;Hessel and Schofield, 2021;Gupta et al, 2021). Similar results are demonstrated by the RoBERTa model (Liu et al, 2019b) when the word order perturbations are incorporated into the pretraining objective (Panda et al, 2021) or tested as a part of full pre-training on the perturbed corpora (Sinha et al, 2021). Sinha et al (2021) find that the randomized RoBERTa models are similar to their naturally pre-trained peer according to parametric probes but perform worse according to the non-parametric ones.…”

Section: Related Worksupporting

confidence: 64%

“…Some studies show that shuffling word order causes significant performance drops on a wide range of QA tasks Sugawara et al, 2020). However, a number of works demonstrates that such permutation has little to no impact during the pre-training and finetuning stages (Pham et al, 2020;Sinha et al, 2020Sinha et al, , 2021O'Connor and Andreas, 2021;Hessel and Schofield, 2021;Gupta et al, 2021). The latter contradict the common understanding on how the hierarchical and structural information is encoded in LMs (Rogers et al, 2020), and even may question if the word order is modeled with the position embeddings Dufter et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

“…This has stimulated a targeted probing of the LMs internal representations generated from original texts and their permuted counterparts (Sinha et al, 2021;Hessel and Schofield, 2021). A new type of controllable probes has been proposed, designed to test the LMs sensitivity to granular character-and sub-word level manipulations (Clouatre et al, 2021), as well as structured syntactic perturbations (Alleman et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

Taktasheva¹,

Mikhailov²,

Artemova³

2021

Preprint

View full text Add to dashboard Cite

Recent research has adopted a new experimental field centered around the concept of text perturbations which has revealed that shuffled word order has little to no impact on the downstream performance of Transformer-based language models across many NLP tasks. These findings contradict the common understanding of how the models encode hierarchical and structural information and even question if the word order is modeled with position embeddings. To this end, this paper proposes nine probing datasets organized by the type of controllable text perturbation for three Indo-European languages with a varying degree of word order flexibility: English, Swedish and Russian. Based on the probing analysis of the M-BERT and M-BART models, we report that the syntactic sensitivity depends on the language and model pre-training objectives. We also find that the sensitivity grows across layers together with the increase of the perturbation granularity. Last but not least, we show that the models barely use the positional information to induce syntactic trees from their intermediate self-attention and contextualized representations.

show abstract

Section: Related Worksupporting

confidence: 64%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

Taktasheva¹,

Mikhailov²,

Artemova³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The Transformer architecture (Vaswani et al, 2017) became the backbone of the state-of-the-art models in a variety of tasks Raffel et al, 2019;Adiwardana et al, 2020;Brown et al, 2020). This spurred a significant interest in better understanding inner workings of these models (Vig and Belinkov, 2019;Clark et al, 2019;Kharitonov and Chaabouni, 2020;Hahn, 2020;Movva and Zhao, 2020;Chaabouni et al, 2021;Merrill et al, 2021;Sinha et al, 2021). Most of these works have focussed specifically on how models generalize and capture structure across samples that are similar.…”

Section: Introductionmentioning

confidence: 99%

How BPE Affects Memorization in Transformers

Kharitonov¹,

Baroni²,

Hupkes³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings, various linguistic idiosyncrasies and common knowledge. However, little is known about what affects the memorization behavior of NLP models, as the field tends to focus on the equally important question of generalization. In this work, we demonstrate that the size of the subword vocabulary learned by Byte-Pair Encoding (BPE) greatly affects both ability and tendency of standard Transformer models to memorize training data, even when we control for the number of learned parameters. We find that with a large subword vocabulary size, Transformer models fit random mappings more easily and are more vulnerable to membership inference attacks. Similarly, given a prompt, Transformer-based language models with large subword vocabularies reproduce the training data more often. We conjecture this effect is caused by reduction in the sequences' length that happens as the BPE vocabulary grows. Our findings can allow a more informed choice of hyper-parameters, that is better tailored for a particular use-case.

show abstract

“…They show that models are insensitive to word reorderings, some of which can actually result in improved task performance. Perhaps most strinkingly, Sinha et al (2021) show that pre-training full-scale RoBERTa models on perturbed sentences (across n-grams of varying lengths) and fine-tuning them on unaltered GLUE tasks leads to negligible performance loss. They also report that a popular probe for dependency structure, that of Pimentel et al (2020), is able to decode trees from the perturbed representations -even a unigram baseline with resampled words -with considerable accuracy.…”

Section: Nlu Evaluationmentioning

confidence: 99%

Schrödinger's Tree -- On Syntax and Neural Language Models

Kulmizev¹,

Nivre²

2021

Preprint

View full text Add to dashboard Cite

In the last half-decade, the field of natural language processing (NLP) has undergone two major transitions: the switch to neural networks as the primary modeling paradigm and the homogenization of the training regime (pre-train, then fine-tune). Amidst this process, language models have emerged as NLP's workhorse, displaying increasingly fluent generation capabilities and proving to be an indispensable means of knowledge transfer downstream. Due to the otherwise opaque, black-box nature of such models, researchers have employed aspects of linguistic theory in order to characterize their behavior. Questions central to syntax -the study of the hierarchical structure of language -have factored heavily into such work, shedding invaluable insights about models' inherent biases and their ability to make human-like generalizations. In this paper, we attempt to take stock of this growing body of literature. In doing so, we observe a lack of clarity across numerous dimensions, which influences the hypotheses that researchers form, as well as the conclusions they draw from their findings. To remedy this, we urge researchers make careful considerations when investigating coding properties, selecting representations, and evaluating via downstream tasks. Furthermore, we outline the implications of the different types of research questions exhibited in studies on syntax, as well as the inherent pitfalls of aggregate metrics. Ultimately, we hope that our discussion adds nuance to the prospect of studying language models and paves the way for a less monolithic perspective on syntax in this context.

show abstract

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Cited by 106 publications

References 62 publications

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

How BPE Affects Memorization in Transformers

Schrödinger's Tree -- On Syntax and Neural Language Models

Contact Info

Product

Resources

About