Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.230
|View full text |Cite
|
Sign up to set email alerts
|

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Abstract: A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks mostly due to their ability to model higher-order word cooccurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and we show that these models still achieve high accuracy… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

7
67
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 106 publications
(98 citation statements)
references
References 62 publications
7
67
0
Order By: Relevance
“…In contrast, several works report that models fine-tuned on such perturbed data still produce high confidence predictions and perform close to their counterparts on many tasks, including the GLUE benchmark (Ahmad et al, 2019;Sinha et al, 2020;Liu et al, 2021;Hessel and Schofield, 2021;Gupta et al, 2021). Similar results are demonstrated by the RoBERTa model (Liu et al, 2019b) when the word order perturbations are incorporated into the pretraining objective (Panda et al, 2021) or tested as a part of full pre-training on the perturbed corpora (Sinha et al, 2021). Sinha et al (2021) find that the randomized RoBERTa models are similar to their naturally pre-trained peer according to parametric probes but perform worse according to the non-parametric ones.…”
Section: Related Worksupporting
confidence: 64%
See 2 more Smart Citations
“…In contrast, several works report that models fine-tuned on such perturbed data still produce high confidence predictions and perform close to their counterparts on many tasks, including the GLUE benchmark (Ahmad et al, 2019;Sinha et al, 2020;Liu et al, 2021;Hessel and Schofield, 2021;Gupta et al, 2021). Similar results are demonstrated by the RoBERTa model (Liu et al, 2019b) when the word order perturbations are incorporated into the pretraining objective (Panda et al, 2021) or tested as a part of full pre-training on the perturbed corpora (Sinha et al, 2021). Sinha et al (2021) find that the randomized RoBERTa models are similar to their naturally pre-trained peer according to parametric probes but perform worse according to the non-parametric ones.…”
Section: Related Worksupporting
confidence: 64%
“…Some studies show that shuffling word order causes significant performance drops on a wide range of QA tasks Sugawara et al, 2020). However, a number of works demonstrates that such permutation has little to no impact during the pre-training and finetuning stages (Pham et al, 2020;Sinha et al, 2020Sinha et al, , 2021O'Connor and Andreas, 2021;Hessel and Schofield, 2021;Gupta et al, 2021). The latter contradict the common understanding on how the hierarchical and structural information is encoded in LMs (Rogers et al, 2020), and even may question if the word order is modeled with the position embeddings Dufter et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The Transformer architecture (Vaswani et al, 2017) became the backbone of the state-of-the-art models in a variety of tasks Raffel et al, 2019;Adiwardana et al, 2020;Brown et al, 2020). This spurred a significant interest in better understanding inner workings of these models (Vig and Belinkov, 2019;Clark et al, 2019;Kharitonov and Chaabouni, 2020;Hahn, 2020;Movva and Zhao, 2020;Chaabouni et al, 2021;Merrill et al, 2021;Sinha et al, 2021). Most of these works have focussed specifically on how models generalize and capture structure across samples that are similar.…”
Section: Introductionmentioning
confidence: 99%
“…They show that models are insensitive to word reorderings, some of which can actually result in improved task performance. Perhaps most strinkingly, Sinha et al (2021) show that pre-training full-scale RoBERTa models on perturbed sentences (across n-grams of varying lengths) and fine-tuning them on unaltered GLUE tasks leads to negligible performance loss. They also report that a popular probe for dependency structure, that of Pimentel et al (2020), is able to decode trees from the perturbed representations -even a unigram baseline with resampled words -with considerable accuracy.…”
Section: Nlu Evaluationmentioning
confidence: 99%