Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.733
|View full text |Cite
|
Sign up to set email alerts
|

On the Sentence Embeddings from Pre-trained Language Models

Abstract: Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked language model pre-training objective and the semantic similarity task theoretically, an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

6
197
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 274 publications
(203 citation statements)
references
References 18 publications
(26 reference statements)
6
197
0
Order By: Relevance
“…They relate this observation to the Zipfian nature of word distributions, where the vast majority of words are infrequent. Li et al (2020a) extend this insight specifically to BERT and show that, while high frequency words concentrate densely, low frequency words are much more sparsely distributed. Though we do not attempt to dispute these claims with our findings, we do hope our experiments will highlight the important role that positional embeddings play in the representational geometry of Transformer-based models.…”
Section: Discussionmentioning
confidence: 67%
See 2 more Smart Citations
“…They relate this observation to the Zipfian nature of word distributions, where the vast majority of words are infrequent. Li et al (2020a) extend this insight specifically to BERT and show that, while high frequency words concentrate densely, low frequency words are much more sparsely distributed. Though we do not attempt to dispute these claims with our findings, we do hope our experiments will highlight the important role that positional embeddings play in the representational geometry of Transformer-based models.…”
Section: Discussionmentioning
confidence: 67%
“…When cosine similarity is viewed primarily as means of semantic comparison between word or sentence vectors, the prospect of calculating cosine similarity for a benchmark like WiC or STS-B becomes erroneous. Though an examination of distance metrics is outside the scope of this study, we acknowledge similar points as having been addressed in regards to static word embeddings (Mimno and Thompson, 2017) as well as contextualized ones (Li et al, 2020b). Likewise, we would like to stress that our manual clipping operation was performed for illustrative purposes and that interested researchers should employ more systematic post-hoc normalization strategies, e.g.…”
Section: Discussionmentioning
confidence: 96%
See 1 more Smart Citation
“…However, the sentence or document embeddings derived from such an MLM without finetuning on in-domain data is shown to be inferior in terms of the ability to capture semantic information that can be used in similarity related tasks (Reimers and Gurevych, 2019). Instead of using the [CLS] vector to obtain sentence embeddings, in this paper we take the average of context embeddings from last two layers as these are shown to be consistently better than using [CLS] vector (Reimers and Gurevych, 2019;Li et al, 2020).…”
Section: Domain Adaptation Using Contextual Semantic Embeddingsmentioning
confidence: 99%
“…Recent years have seen huge success of pretrained language models across a wide range of NLP tasks (Devlin et al, 2019;Lewis et al, 2020). However, several studies (Reimers and Gurevych, 2019;Li et al, 2020) have found that sentence embeddings from pre-trained language models perform poorly on semantic similarity tasks when the models are not fine-tuned on task-specific data. Meanwhile, Goldberg (2019) shows that BERT without fine-tuning performs surprisingly well on syntactic tasks.…”
Section: Introductionmentioning
confidence: 99%