Revisiting Representation Degeneration Problem in Language Modeling

Zhang, Zhong; Gao, Chongming; Xu, Cong; Miao, Rui; Yang, Qinli; Shao, Jun

doi:10.18653/v1/2020.findings-emnlp.46

Cited by 10 publications

(8 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Making the embedding space isotropic has theoretical and empirical benefits (Gao et al, 2019). Several approaches have been proposed to improve isotropy in monolingual CWRs (Wieting and Kiela, 2019;Li et al, 2020;Zhang et al, 2020). Most proposed approaches need re-training models with additional objectives to address the degeneration problem, which is a costly process.…”

Section: Methodsmentioning

confidence: 99%

“…Previous research has shown that many pre-trained models, such as GPT-2 (Radford et al, 2019), ELMo (Peters et al, 2018), BERT, and RoBERTa , have degenerated embedding spaces that downgrade their semantic expressiveness (Ethayarajh, 2019;Cai et al, 2021;Rajaee and Pilehvar, 2021). Several proposals have been put forward to overcome this challenge (Gao et al, 2019;Zhang et al, 2020). However, to our knowledge, no study has so far been conducted on the degeneration problem in the multilingual embedding space.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Isotropy Analysis in the Multilingual BERT Embedding Space

Rajaee¹,

Pilehvar²

2021

Preprint

View full text Add to dashboard Cite

Several studies have explored various advantages of multilingual pre-trained models (e.g., multilingual BERT) in capturing shared linguistic knowledge. However, their limitations have not been paid enough attention. In this paper, we investigate the representation degeneration problem in multilingual contextual word representations (CWRs) of BERT and show that the embedding spaces of the selected languages suffer from anisotropy problem. Our experimental results demonstrate that, similarly to their monolingual counterparts, increasing the isotropy of multilingual embedding space can significantly improve its representation power and performance. Our analysis indicates that although the degenerated directions vary in different languages, they encode similar linguistic knowledge, suggesting a shared linguistic space among languages.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An Isotropy Analysis in the Multilingual BERT Embedding Space

Rajaee¹,

Pilehvar²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, for an image contrast learning model, like MoCo, experimental results suggests that longer queue size increases the performance. We believe that this is due to the phenomenon of unique anisotropy (Zhang et al, 2020b) of text that causes such differences. The text is influenced by the word frequency producing the phenomenon of anisotropy with uneven distribution, which is different from the near-uniform distribution of pixel points of image data.…”

Section: Maximum Traceable Distance Metricmentioning

confidence: 96%

Exploring the Impact of Negative Samples of Contrastive Learning: A Case Study of Sentence Embedding

Cao¹,

Wang²,

Liang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Contrastive learning is emerging as a powerful technique for extracting knowledge from unlabeled data. This technique requires a balanced mixture of two ingredients: positive (similar) and negative (dissimilar) samples. This is typically achieved by maintaining a queue of negative samples during training. Prior works in the area typically uses a fixed-length negative sample queue, but how the negative sample size affects the model performance remains unclear. The opaque impact of the number of negative samples on performance when employing contrastive learning aroused our in-depth exploration. This paper presents a momentum contrastive learning model with negative sample queue for sentence embedding, namely MoCoSE. We add the prediction layer to the online branch to make the model asymmetric and together with EMA update mechanism of the target branch to prevent model from collapsing. We define a maximum traceable distance metric, through which we learn to what extent the text contrastive learning benefits from the historical information of negative samples. Our experiments find that the best results are obtained when the maximum traceable distance is at a certain range, demonstrating that there is an optimal range of historical information for a negative sample queue. We evaluate the proposed unsupervised MoCoSE on the semantic text similarity (STS) task and obtain an average Spearman's correlation of 77.27%. Source code is available here.

show abstract

“…Following the discovery of anisotropy in transformers (Gao et al, 2019;Ethayarajh, 2019), different isotropy calibration methods have been developed to correct this phenomenon. Gao et al (2019) and Zhang et al (2020) introduced regularization objectives that affect the embedding distances. Zhou et al (2021) presented a module inspired by batch-norm that regularizes the embeddings towards isotropic representations.…”

Section: Related Workmentioning

confidence: 99%

On Isotropy Calibration of Transformer Models

Ding¹,

Martinkus²,

Pascual³

et al. 2022

Proceedings of the Third Workshop on Insights From Negative Results in NLP

View full text Add to dashboard Cite

Different studies of the embedding space of transformer models suggest that the distribution of contextual representations is highly anisotropic -the embeddings are distributed in a narrow cone. Meanwhile, static word representations (e.g., Word2Vec or GloVe) have been shown to benefit from isotropic spaces. Therefore, previous work has developed methods to calibrate the embedding space of transformers in order to ensure isotropy. However, a recent study (Cai et al., 2021) shows that the embedding space of transformers is locally isotropic, which suggests that these models are already capable of exploiting the expressive capacity of their embedding space. In this work, we conduct an empirical evaluation of state-of-the-art methods for isotropy calibration on transformers and find that they do not provide consistent improvements across models and tasks. These results support the thesis that, given the local isotropy, transformers do not benefit from additional isotropy calibration.

show abstract

Revisiting Representation Degeneration Problem in Language Modeling

Cited by 10 publications

References 17 publications

An Isotropy Analysis in the Multilingual BERT Embedding Space

An Isotropy Analysis in the Multilingual BERT Embedding Space

Exploring the Impact of Negative Samples of Contrastive Learning: A Case Study of Sentence Embedding

On Isotropy Calibration of Transformer Models

Contact Info

Product

Resources

About