2021
DOI: 10.48550/arxiv.2104.06644
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Abstract: A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

6
33
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 29 publications
(43 citation statements)
references
References 44 publications
(82 reference statements)
6
33
1
Order By: Relevance
“…Another interesting finding is that the word order does not have a great effect on TNN models (decreases smaller than 0.01). This is in line with recent research that indicates that the word order might not be as important as initially thought for transformer models [45,51].…”
Section: Robustness To Query Variationssupporting
confidence: 92%
“…Another interesting finding is that the word order does not have a great effect on TNN models (decreases smaller than 0.01). This is in line with recent research that indicates that the word order might not be as important as initially thought for transformer models [45,51].…”
Section: Robustness To Query Variationssupporting
confidence: 92%
“…This suggests the power of simple n-gram models may be underestimated previously, as they are typically trained from scratch, without modern techniques such as pre-training and knowledge distillation. This also echoes with a series of recent work that questions the necessity of word order information (Sinha et al, 2021) and self-attention (You et al, 2020) We provide more details and list the inference speed for IMDB and SST-2 in Table 3. We have previously visualized the speed comparison on IMDB dataset on in Fig.…”
Section: Resultssupporting
confidence: 65%
“…One weakness of DANs is that they are restricted in modeling high-level meanings in long-range contexts, as compared to the self-attention operator in Transformers. However, recent studies have shown that large pre-trained Transformers are rather insensitive to word order (Sinha et al, 2021) and that they still work well when the learned selfattention is replaced with hard-coded localized attention (You et al, 2020). Taken together, these studies suggest that on some tasks it may be possible to get competitive results without computationally expensive operations such as self-attention.…”
Section: Introductionmentioning
confidence: 99%
“…This work follows a new experimental direction that employs text perturbations in order to explore the sensitivity of language models to specific phenomena (Futrell et al, 2019;Ettinger, 2020;Taktasheva et al, 2021). It has been shown, for example, that shuffling word order causes significant performance drops on a wide range of QA tasks (Si et al, 2019;Sugawara et al, 2019), but that state-of-the-art NLU models are not sensitive to word order (Pham et al, 2020;Sinha et al, 2021). We add to this line of research by applying data corruption transformations that involve removing entire word classes (Talman et al, 2021) to all but one GLUE tasks.…”
Section: Related Workmentioning
confidence: 99%