Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Sinha, Koustuv; Jia, Robin; Hupkes, Dieuwke; Pineau, Joëlle; Williams, Adina

doi:10.48550/arxiv.2104.06644

Cited by 29 publications

(43 citation statements)

References 44 publications

(82 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another interesting finding is that the word order does not have a great effect on TNN models (decreases smaller than 0.01). This is in line with recent research that indicates that the word order might not be as important as initially thought for transformer models [45,51].…”

Section: Robustness To Query Variationssupporting

confidence: 92%

Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators

Penha¹,

Câmara²,

Hauff³

2021

Preprint

View full text Add to dashboard Cite

Heavily pre-trained transformers for language modelling, such as BERT, have shown to be remarkably effective for Information Retrieval (IR) tasks. IR benchmarks evaluate the effectiveness of (neural) ranking models based on the premise that a single query is used to instantiate the underlying information need. However, previous research has shown that (I) queries generated by users for a fixed information need are extremely variable and, in particular, (II) neural models are brittle and often easily make mistakes when tested with adversarial examples, i.e. examples with minimal modifications that do not change its label. Motivated by those observations we aim to answer the following question with our work: how robust are retrieval pipelines with respect to different variations in queries that do not change the queries' semantics? In order to obtain queries that are representative of users' querying variability, we first created a taxonomy based on the manual annotation of transformations occurring in a dataset (specifically UQV100) of user created query variations. For example, from the query 'cures for a bald spot' to the variation 'cures for baldness' we are applying a paraphrasing transformation that replaces words with synonyms. For each syntax-changing category of our taxonomy, we employ different automatic methods that when applied to a query generate a query variation. We conduct experiments on two datasets (TREC-DL-2019 and ANTIQUE) and create a total of 2430 query variations from 243 topics across both datasets. Our experimental results for two different IR tasks reveal that retrieval pipelines are not robust to query variations that maintain the content the same, with effectiveness drops of ∼20% on average when compared with the original query as provided in the datasets. Our findings indicate that further work is required to make retrieval pipelines with neural ranking models more robust and that IR collections should include query variations, e.g. using the methods proposed here, for a single information need to better understand models capabilities. The code and datasets are available at https://github.com/Guzpenha/query_variation_generators.

show abstract

Section: Robustness To Query Variationssupporting

confidence: 92%

Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators

Penha¹,

Câmara²,

Hauff³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This suggests the power of simple n-gram models may be underestimated previously, as they are typically trained from scratch, without modern techniques such as pre-training and knowledge distillation. This also echoes with a series of recent work that questions the necessity of word order information (Sinha et al, 2021) and self-attention (You et al, 2020) We provide more details and list the inference speed for IMDB and SST-2 in Table 3. We have previously visualized the speed comparison on IMDB dataset on in Fig.…”

Section: Resultssupporting

confidence: 65%

“…One weakness of DANs is that they are restricted in modeling high-level meanings in long-range contexts, as compared to the self-attention operator in Transformers. However, recent studies have shown that large pre-trained Transformers are rather insensitive to word order (Sinha et al, 2021) and that they still work well when the learned selfattention is replaced with hard-coded localized attention (You et al, 2020). Taken together, these studies suggest that on some tasks it may be possible to get competitive results without computationally expensive operations such as self-attention.…”

Section: Introductionmentioning

confidence: 99%

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Ye¹,

Khabsa²,

Lewis³

et al. 2021

Preprint

View full text Add to dashboard Cite

Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. However, the improved inference speed may be still unsatisfactory for certain timesensitive applications. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. More specifically, we consider distilling a transformer-based text classifier into a billion-parameter, sparsely-activated student model with a embedding-averaging architecture. Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks. Meanwhile, the student model achieves up to 600x speed-up on both GPUs and CPUs, compared to the teacher models. Further investigation shows that our pipeline is also effective in privacy-preserving and domain generalization settings.

show abstract

“…This work follows a new experimental direction that employs text perturbations in order to explore the sensitivity of language models to specific phenomena (Futrell et al, 2019;Ettinger, 2020;Taktasheva et al, 2021). It has been shown, for example, that shuffling word order causes significant performance drops on a wide range of QA tasks (Si et al, 2019;Sugawara et al, 2019), but that state-of-the-art NLU models are not sensitive to word order (Pham et al, 2020;Sinha et al, 2021). We add to this line of research by applying data corruption transformations that involve removing entire word classes (Talman et al, 2021) to all but one GLUE tasks.…”

Section: Related Workmentioning

confidence: 99%

How Does Data Corruption Affect Natural Language Understanding Models? A Study on GLUE datasets

Talman¹,

Apidianaki²,

Chatzikyriakidis³

et al. 2022

Preprint

View full text Add to dashboard Cite

A central question in natural language understanding (NLU) research is whether high performance demonstrates the models' strong reasoning capabilities. We present an extensive series of controlled experiments where pretrained language models are exposed to data that have undergone specific corruption transformations. The transformations involve removing instances of specific word classes and often lead to non-sensical sentences. Our results show that performance remains high for most GLUE tasks when the models are finetuned or tested on corrupted data, suggesting that the models leverage other cues for prediction even in non-sensical contexts. Our proposed data transformations can be used as a diagnostic tool for assessing the extent to which a specific dataset constitutes a proper testbed for evaluating models' language understanding capabilities.

show abstract

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Cited by 29 publications

References 44 publications

Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators

Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

How Does Data Corruption Affect Natural Language Understanding Models? A Study on GLUE datasets

Contact Info

Product

Resources

About