Contextual Embeddings: When Are They Worth It?

Arora, Simran; May, Avner; Zhang, Jian; Ré, Christopher

doi:10.18653/v1/2020.acl-main.236

Cited by 48 publications

(37 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The key distinguishing factor between these models is the scale of unsupervised pretraining; therefore, we hypothesize that pretraining provides the same benefits targeted by common augmentation techniques. Arora et al (2020) LSTM requires augmentation to classify correctly, but ROBERTA does not, we observe rare word choice, atypical sentence structure and generally off-beat reviews. This set contains reviews such as "suffers from over-familiarity since hit-hungry british filmmakers have strip-mined the monty formula mercilessly since 1997", "wishy-washy", or "wanker goths are on the loose!…”

Section: Why Can Data Augmentation Be Ineffective?mentioning

confidence: 81%

How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?

Longpre

Wang

DuBois

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Task-agnostic forms of data augmentation have proven widely effective in computer vision, even on pretrained models. In NLP similar results are reported most commonly for low data regimes, non-pretrained models, or situationally for pretrained models. In this paper we ask how effective these techniques really are when applied to pretrained transformers. Using two popular varieties of task-agnostic data augmentation (not tailored to any particular task), Easy Data Augmentation (Wei and Zou, 2019) and Back-Translation (Sennrich et al., 2015), we conduct a systematic examination of their effects across 5 classification tasks, 6 datasets, and 3 variants of modern pretrained transformers, including BERT, XL-NET, and ROBERTA. We observe a negative result, finding that techniques which previously reported strong improvements for nonpretrained models fail to consistently improve performance for pretrained transformers, even when training data is limited. We hope this empirical analysis helps inform practitioners where data augmentation techniques may confer improvements.

show abstract

Section: Why Can Data Augmentation Be Ineffective?mentioning

confidence: 81%

How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?

Longpre

Wang

DuBois

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…Our baseline results show that contextual embeddings outperform the non-contextual methods across all tasks. Arora et al ( 2020 ) also compared randomly initialized, GloVe and BERT embeddings and found that with smaller training sets, the difference in performance between these three embedding types is larger. This is in accordance with our results, which show that the type of embedding has a large impact on the baseline performance on all three tasks.…”

Section: Discussionmentioning

confidence: 99%

Decoding EEG Brain Activity for Multi-Modal Natural Language Processing

Hollenstein

Renggli

Glaus

et al. 2021

Front. Hum. Neurosci.

View full text Add to dashboard Cite

Until recently, human behavioral data from reading has mainly been of interest to researchers to understand human cognition. However, these human language processing signals can also be beneficial in machine learning-based natural language processing tasks. Using EEG brain activity for this purpose is largely unexplored as of yet. In this paper, we present the first large-scale study of systematically analyzing the potential of EEG brain activity data for improving natural language processing tasks, with a special focus on which features of the signal are most beneficial. We present a multi-modal machine learning architecture that learns jointly from textual input as well as from EEG features. We find that filtering the EEG signals into frequency bands is more beneficial than using the broadband signal. Moreover, for a range of word embedding types, EEG data improves binary and ternary sentiment classification and outperforms multiple baselines. For more complex tasks such as relation detection, only the contextualized BERT embeddings outperform the baselines in our experiments, which raises the need for further research. Finally, EEG data shows to be particularly promising when limited training data is available.

show abstract

“…Classification tasks that require capturing of general lexical semantics can be successfully distilled by very simple and efficient models; however, classification tasks that require the detection of linguistic structure and contextual relations are more challenging for distillation using simple student models. For future work, we aim to explore the impact of the datasets' linguistic structures on the distillation success and to develop dataset-related measurements (Arora et al, 2020) for predicting the success of the distillation in relation to different student models.…”

Section: Discussionmentioning

confidence: 99%

Exploring the Boundaries of Low-Resource BERT Distillation

Wasserblat

Pereg

Izsak

2020

Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

View full text Add to dashboard Cite

In recent years, large pre-trained models have demonstrated state-of-the-art performance in many NLP tasks. However, the deployment of these models on devices with limited resources is challenging due to the models' large computational consumption and memory requirements. Moreover, the need for a considerable amount of labeled training data also hinders real-world deployment scenarios. Model distillation has shown promising results for reducing model size, computational load and data efficiency. In this paper we test the boundaries of BERT model distillation in terms of model compression, inference efficiency and data scarcity. We show that classification tasks that require the capturing of general lexical semantics can be successfully distilled by very simple and efficient models and require relatively small amount of labeled training data. We also show that the distillation of large pretrained models is more effective in real-life scenarios where limited amounts of labeled training are available.

show abstract

Contextual Embeddings: When Are They Worth It?

Cited by 48 publications

References 10 publications

How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?

How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?

Decoding EEG Brain Activity for Multi-Modal Natural Language Processing

Exploring the Boundaries of Low-Resource BERT Distillation

Contact Info

Product

Resources

About