A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space

Rajaee, Sara; Pilehvar, Mohammad Taher

doi:10.18653/v1/2021.acl-short.73

Cited by 15 publications

(19 citation statements)

References 17 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cluster-based app. Based on the clustered structure of pre-trained LMs Reif et al, 2019), this method can significantly improve the performance of contextual embedding spaces as well as their isotropy (Rajaee and Pilehvar, 2021).…”

Section: Methodsmentioning

confidence: 99%

“…To answer these questions, we consider the semantic textual similarity (STS) as the target task and leverage the metric proposed by Mu and Viswanath (2018) for measuring isotropy. The pretrained BERT and RoBERTa (Liu et al, 2019b) underperform static embeddings on STS, while fine-tuning significantly boosts their performance, suggesting the considerable change that CWRs un-dergo during fine-tuning (Reimers and Gurevych, 2019;Rajaee and Pilehvar, 2021).…”

Section: Introductionmentioning

confidence: 99%

“…Contextual embedding spaces are known to lack the desirable isotropy property (Rajaee and Pilehvar, 2021;Ethayarajh, 2019). Gao et al (2019) called the defect the representation degeneration problem and attributed it mainly to the weight tying trick (Press and Wolf, 2017) and the language modeling as the objective of the training.…”

Section: Introductionmentioning

confidence: 99%

“…Cosine similarity-based metrics have usually been employed for assessing the isotropy of embedding spaces where a near-zero cosine similarity between random embeddings indicates isotropic distribution. However, Rajaee and Pilehvar (2021) demonstrated that these metrics might not be reliable for calculating isotropy since, in some cases, the cosine similarity of random words is zero while their distribution is not uniform. Hence, we utilize another metric based on Principal Components (PCs).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

How Does Fine-tuning Affect the Geometry of Embedding Space: A Case Study on Isotropy

Rajaee¹,

Pilehvar²

2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

It is widely accepted that fine-tuning pretrained language models usually brings about performance improvements in downstream tasks. However, there are limited studies on the reasons behind this effectiveness, particularly from the viewpoint of structural changes in the embedding space. Trying to fill this gap, in this paper, we analyze the extent to which the isotropy of the embedding space changes after fine-tuning. We demonstrate that, even though isotropy is a desirable geometrical property, fine-tuning does not necessarily result in isotropy enhancements. Moreover, local structures in pre-trained contextual word representations (CWRs), such as those encoding token types or frequency, undergo a massive change during fine-tuning. Our experiments show dramatic growth in the number of elongated directions in the embedding space, which, in contrast to pre-trained CWRs, carry the essential linguistic knowledge in the fine-tuned embedding space, making existing isotropy enhancement methods ineffective.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

How Does Fine-tuning Affect the Geometry of Embedding Space: A Case Study on Isotropy

Rajaee¹,

Pilehvar²

2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…The same metric is used for measuring isotropy of contextual word representations byRajaee and Pilehvar (2021).8 We randomly sample 10k sentences from English Wikipedia as V. We compute the average word-in-context embeddings for all words in each sentence and then compute the IS value. We repeat the process for five times to reduce the randomness introduced in sampling.…”

mentioning

confidence: 99%

MirrorWiC: On Eliciting Word-in-Context Representations from Pretrained Language Models

Liu¹,

Liu²,

Collier³

et al. 2021

Proceedings of the 25th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

Recent work indicated that pretrained language models (PLMs) such as BERT and RoBERTa can be transformed into effective sentence and word encoders even via simple self-supervised techniques. Inspired by this line of work, in this paper we propose a fully unsupervised approach to improving word-in-context (WiC) representations in PLMs, achieved via a simple and efficient WiC-targeted fine-tuning procedure: MIRROR-WIC. The proposed method leverages only raw texts sampled from Wikipedia, assuming no sense-annotated data, and learns contextaware word representations within a standard contrastive learning setup. We experiment with a series of standard and comprehensive WiC benchmarks across multiple languages. Our proposed fully unsupervised MIRROR-WIC models obtain substantial gains over offthe-shelf PLMs across all monolingual, multilingual and cross-lingual setups. Moreover, on some standard WiC benchmarks, MIRROR-WIC is even on-par with supervised models fine-tuned with in-task data and sense labels.

show abstract

Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings Using a Joint Loss Function

Attieh,

Woubie Zewoudie,

Vlassov

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Recent studies show that the spatial distribution of the sentence representations generated from pre-trained language models is highly anisotropic, meaning that the representations are not uniformly distributed among the directions of the embedding space. Thus, the expressiveness of the embedding space is limited, as the embeddings are less distinguishable and less diverse. This results in a degradation in the performance of the models on the downstream task. Most methods that define the state-of-the-art in this area proceed by improving the isotropy of the sentence embeddings by refining the corresponding contextual word representations, then deriving the sentence embeddings from these refined representations. In this thesis, we propose to improve the quality and distribution of the sentence embeddings extracted from the [CLS] token of the pre-trained language models by improving the isotropy of the embeddings. We add one feed-forward layer, referred to as the Isotropy Layer, between the model and the downstream task layers. We train this layer using a novel joint loss function that optimizes an isotropy quality measure and the downstream task loss. This joint loss pushes the embeddings outputted by the Isotropy Layer to be more isotropic, and it also retains the semantics needed to perform the downstream task. The proposed approach results in transformed embeddings with better isotropy, that generalize better on the downstream task. Furthermore, the approach requires training one feed-forward layer, instead of retraining the whole network. We quantify and evaluate the isotropy through multiple metrics, mainly the Explained Variance and the IsoScore. Experimental results on 3 GLUE datasets with classification as the downstream task show that our proposed method is on par with the state-of-the-art, as it achieves performance gains of around 2-3% on the downstream tasks compared to the baseline. We also present a small case study on one language abuse detection dataset, then interpret some of the findings in light of the results.

show abstract

A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space

Cited by 15 publications

References 17 publications

How Does Fine-tuning Affect the Geometry of Embedding Space: A Case Study on Isotropy

How Does Fine-tuning Affect the Geometry of Embedding Space: A Case Study on Isotropy

MirrorWiC: On Eliciting Word-in-Context Representations from Pretrained Language Models

Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings Using a Joint Loss Function

Contact Info

Product

Resources

About