Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.236
|View full text |Cite
|
Sign up to set email alerts
|

Contextual Embeddings: When Are They Worth It?

Abstract: We study the settings for which deep contextual embeddings (e.g., BERT) give large improvements in performance relative to classic pretrained embeddings (e.g., GloVe), and an even simpler baseline-random word embeddings-focusing on the impact of the training set size and the linguistic properties of the task. Surprisingly, we find that both of these simpler baselines can match contextual embeddings on industry-scale data, and often perform within 5 to 10% accuracy (absolute) on benchmark tasks. Furthermore, we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
32
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 48 publications
(37 citation statements)
references
References 10 publications
2
32
0
Order By: Relevance
“…The key distinguishing factor between these models is the scale of unsupervised pretraining; therefore, we hypothesize that pretraining provides the same benefits targeted by common augmentation techniques. Arora et al (2020) LSTM requires augmentation to classify correctly, but ROBERTA does not, we observe rare word choice, atypical sentence structure and generally off-beat reviews. This set contains reviews such as "suffers from over-familiarity since hit-hungry british filmmakers have strip-mined the monty formula mercilessly since 1997", "wishy-washy", or "wanker goths are on the loose!…”
Section: Why Can Data Augmentation Be Ineffective?mentioning
confidence: 81%
“…The key distinguishing factor between these models is the scale of unsupervised pretraining; therefore, we hypothesize that pretraining provides the same benefits targeted by common augmentation techniques. Arora et al (2020) LSTM requires augmentation to classify correctly, but ROBERTA does not, we observe rare word choice, atypical sentence structure and generally off-beat reviews. This set contains reviews such as "suffers from over-familiarity since hit-hungry british filmmakers have strip-mined the monty formula mercilessly since 1997", "wishy-washy", or "wanker goths are on the loose!…”
Section: Why Can Data Augmentation Be Ineffective?mentioning
confidence: 81%
“…Our baseline results show that contextual embeddings outperform the non-contextual methods across all tasks. Arora et al ( 2020 ) also compared randomly initialized, GloVe and BERT embeddings and found that with smaller training sets, the difference in performance between these three embedding types is larger. This is in accordance with our results, which show that the type of embedding has a large impact on the baseline performance on all three tasks.…”
Section: Discussionmentioning
confidence: 99%
“…Classification tasks that require capturing of general lexical semantics can be successfully distilled by very simple and efficient models; however, classification tasks that require the detection of linguistic structure and contextual relations are more challenging for distillation using simple student models. For future work, we aim to explore the impact of the datasets' linguistic structures on the distillation success and to develop dataset-related measurements (Arora et al, 2020) for predicting the success of the distillation in relation to different student models.…”
Section: Discussionmentioning
confidence: 99%