Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.368
|View full text |Cite
|
Sign up to set email alerts
|

BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance

Abstract: Pretraining deep language models has led to large performance gains in NLP. Despite this success, Schick and Schütze (2020) recently showed that these models struggle to understand rare words. For static word embeddings, this problem has been addressed by separately learning representations for rare words. In this work, we transfer this idea to pretrained language models: We introduce BERTRAM, a powerful architecture based on BERT that is capable of inferring high-quality embeddings for rare words that are sui… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
0
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
8
1

Relationship

1
8

Authors

Journals

citations
Cited by 34 publications
(31 citation statements)
references
References 26 publications
(61 reference statements)
0
30
0
1
Order By: Relevance
“…E-BERT-concat. E-BERT-concat combines entity IDs and wordpieces by string concatenation, with the slash symbol as separator (Schick and Schütze, 2019). For example, the wordpiecetokenized input…”
Section: Using Aligned Entity Vectorsmentioning
confidence: 99%
“…E-BERT-concat. E-BERT-concat combines entity IDs and wordpieces by string concatenation, with the slash symbol as separator (Schick and Schütze, 2019). For example, the wordpiecetokenized input…”
Section: Using Aligned Entity Vectorsmentioning
confidence: 99%
“…The only exception is RoBERTa-large, which had the lowest initial isotropy. Interestingly, Schick and Schütze (2020a) show that RoBERTa-large outperforms BERT models on tasks designed explicitly for rare words. Moreover, according to common leaderboards (Wang et al, 2019b,a), RoBERTa performs best on downstream tasks among the models we analyzed.…”
Section: Resultsmentioning
confidence: 99%
“…Performance of pretrained language models is inconsistent and tends to decrease when input contains rare words (Schick and Schütze, 2020b,a). Schick and Schütze (2020a) observe that replacing a portion of words in the MNLI (Williams et al, 2018) entailment data set with less frequent synonyms leads to decrease in performance of BERT-base and RoBERTa-large by 30% and 21.8% respectively. 2 After enriching rare words with surface-form features and additional context, Schick and Schütze (2020a) decrease the performance gap to 20.7% for BERT and 17% for RoBERTa, but the gap remains large nonetheless.…”
Section: Introductionmentioning
confidence: 95%
“…Handling Rare Words. These remain challenging even for large transformer models (Schick and Schütze, 2020). Recent work has explored copying mechanisms and character based generation (Kawakami et al, 2017), with some success.…”
Section: Related Workmentioning
confidence: 99%