On the Sentence Embeddings from Pre-trained Language Models

Li, Bohan; Zhou, Hao; He, Junxian; Wang, Mingxuan; Yang, Yiming; Li, Lei

doi:10.18653/v1/2020.emnlp-main.733

Cited by 274 publications

(203 citation statements)

References 18 publications

(26 reference statements)

Supporting

Mentioning

197

Contrasting

Order By: Relevance

“…They relate this observation to the Zipfian nature of word distributions, where the vast majority of words are infrequent. Li et al (2020a) extend this insight specifically to BERT and show that, while high frequency words concentrate densely, low frequency words are much more sparsely distributed. Though we do not attempt to dispute these claims with our findings, we do hope our experiments will highlight the important role that positional embeddings play in the representational geometry of Transformer-based models.…”

Section: Discussionmentioning

confidence: 67%

“…When cosine similarity is viewed primarily as means of semantic comparison between word or sentence vectors, the prospect of calculating cosine similarity for a benchmark like WiC or STS-B becomes erroneous. Though an examination of distance metrics is outside the scope of this study, we acknowledge similar points as having been addressed in regards to static word embeddings (Mimno and Thompson, 2017) as well as contextualized ones (Li et al, 2020b). Likewise, we would like to stress that our manual clipping operation was performed for illustrative purposes and that interested researchers should employ more systematic post-hoc normalization strategies, e.g.…”

Section: Discussionmentioning

confidence: 96%

“…Nonetheless, much of this work likewise fails to engage with the raw vector spaces of language models, preferring instead to focus its analysis on the transformed vectors. Indeed, the fraction of work that has done the former has shed some curious insights: that untransformed BERT sentence representations still lag behind word embeddings across a variety of semantic benchmarks (Reimers and Gurevych, 2019) and that the vector spaces of language models are explicitly anisotropic (Ethayarajh, 2019;Li et al, 2020a). Certainly, an awareness of the patterns inherent to models' untransformed vector spaces -even if shallow -can only benefit the transformation-based analyses outlined above.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Positional Artefacts Propagate Through Masked Language Model Embeddings

Luo¹,

Kulmizev²,

Mao³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

In this work, we demonstrate that the contextualized word vectors derived from pretrained masked language model-based encoders share a common, perhaps undesirable pattern across layers. Namely, we find cases of persistent outlier neurons within BERT and RoBERTa's hidden state vectors that consistently bear the smallest or largest values in said vectors. In an attempt to investigate the source of this information, we introduce a neuron-level analysis method, which reveals that the outliers are closely related to information captured by positional embeddings. We also pre-train the RoBERTa-base models from scratch and find that the outliers disappear without using positional embeddings. These outliers, we find, are the major cause of anisotropy of encoders' raw vector spaces, and clipping them leads to increased similarity across vectors. We demonstrate this in practice by showing that clipped vectors can more accurately distinguish word senses, as well as lead to better sentence embeddings when mean pooling. In three supervised tasks, we find that clipping does not affect the performance.

show abstract

Section: Discussionmentioning

confidence: 67%

Section: Discussionmentioning

confidence: 96%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Positional Artefacts Propagate Through Masked Language Model Embeddings

Luo¹,

Kulmizev²,

Mao³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…However, the sentence or document embeddings derived from such an MLM without finetuning on in-domain data is shown to be inferior in terms of the ability to capture semantic information that can be used in similarity related tasks (Reimers and Gurevych, 2019). Instead of using the [CLS] vector to obtain sentence embeddings, in this paper we take the average of context embeddings from last two layers as these are shown to be consistently better than using [CLS] vector (Reimers and Gurevych, 2019;Li et al, 2020).…”

Section: Domain Adaptation Using Contextual Semantic Embeddingsmentioning

confidence: 99%

ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling

Shenoy¹,

Bodapati²,

Kirchhoff³

2021

Proceedings of the 4th Workshop on E-Commerce and NLP

View full text Add to dashboard Cite

Automatic Speech Recognition (ASR) robustness toward slot entities are critical in ecommerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language model (NLM) to rescore ASR N-best hypotheses. To improve contextualization, we utilize turn level dialogue acts along with cross utterance context carry over. Additionally, to adapt our domaingeneral NLM towards e-commerce on-the-fly, we use embeddings derived from a finetuned masked LM on in-domain data. Finally, to improve robustness towards in-domain content words, we propose a multi-task model that can jointly perform content word detection and language modeling tasks. Compared to a noncontextual LSTM LM baseline, our best performing NLM rescorer results in a content WER reduction of 19.2% on e-commerce audio test set and a slot labeling F1 improvement of 6.4%.

show abstract

“…Recent years have seen huge success of pretrained language models across a wide range of NLP tasks (Devlin et al, 2019;Lewis et al, 2020). However, several studies (Reimers and Gurevych, 2019;Li et al, 2020) have found that sentence embeddings from pre-trained language models perform poorly on semantic similarity tasks when the models are not fine-tuned on task-specific data. Meanwhile, Goldberg (2019) shows that BERT without fine-tuning performs surprisingly well on syntactic tasks.…”

Section: Introductionmentioning

confidence: 99%

Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models

Y.¹,

Huang

Chang

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Pre-trained language models have achieved huge success on a wide range of NLP tasks. However, contextual representations from pretrained models contain entangled semantic and syntactic information, and therefore cannot be directly used to derive useful semantic sentence embeddings for some tasks. Paraphrase pairs offer an effective way of learning the distinction between semantics and syntax, as they naturally share semantics and often vary in syntax. In this work, we present ParaBART, a semantic sentence embedding model that learns to disentangle semantics and syntax in sentence embeddings obtained by pre-trained language models. ParaBART is trained to perform syntax-guided paraphrasing, based on a source sentence that shares semantics with the target paraphrase, and a parse tree that specifies the target syntax. In this way, ParaBART learns disentangled semantic and syntactic representations from their respective inputs with separate encoders. Experiments in English show that ParaBART outperforms state-of-theart sentence embedding models on unsupervised semantic similarity tasks. Additionally, we show that our approach can effectively remove syntactic information from semantic sentence embeddings, leading to better robustness against syntactic variation on downstream semantic tasks.

show abstract

On the Sentence Embeddings from Pre-trained Language Models

Cited by 274 publications

References 18 publications

Positional Artefacts Propagate Through Masked Language Model Embeddings

Positional Artefacts Propagate Through Masked Language Model Embeddings

ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling

Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models

Contact Info

Product

Resources

About