Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.46
|View full text |Cite
|
Sign up to set email alerts
|

Revisiting Representation Degeneration Problem in Language Modeling

Abstract: Weight tying is now a common setting in many language generation tasks such as language modeling and machine translation. However, a recent study reveals that there is a potential flaw in weight tying. They find that the learned word embeddings are likely to degenerate and lie in a narrow cone when training a language model. They call it the representation degeneration problem and propose a cosine regularization to solve it. Nevertheless, we prove that the cosine regularization is insufficient to solve the pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 17 publications
0
8
0
Order By: Relevance
“…Making the embedding space isotropic has theoretical and empirical benefits (Gao et al, 2019). Several approaches have been proposed to improve isotropy in monolingual CWRs (Wieting and Kiela, 2019;Li et al, 2020;Zhang et al, 2020). Most proposed approaches need re-training models with additional objectives to address the degeneration problem, which is a costly process.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Making the embedding space isotropic has theoretical and empirical benefits (Gao et al, 2019). Several approaches have been proposed to improve isotropy in monolingual CWRs (Wieting and Kiela, 2019;Li et al, 2020;Zhang et al, 2020). Most proposed approaches need re-training models with additional objectives to address the degeneration problem, which is a costly process.…”
Section: Methodsmentioning
confidence: 99%
“…Previous research has shown that many pre-trained models, such as GPT-2 (Radford et al, 2019), ELMo (Peters et al, 2018), BERT, and RoBERTa , have degenerated embedding spaces that downgrade their semantic expressiveness (Ethayarajh, 2019;Cai et al, 2021;Rajaee and Pilehvar, 2021). Several proposals have been put forward to overcome this challenge (Gao et al, 2019;Zhang et al, 2020). However, to our knowledge, no study has so far been conducted on the degeneration problem in the multilingual embedding space.…”
Section: Introductionmentioning
confidence: 99%
“…However, for an image contrast learning model, like MoCo, experimental results suggests that longer queue size increases the performance. We believe that this is due to the phenomenon of unique anisotropy (Zhang et al, 2020b) of text that causes such differences. The text is influenced by the word frequency producing the phenomenon of anisotropy with uneven distribution, which is different from the near-uniform distribution of pixel points of image data.…”
Section: Maximum Traceable Distance Metricmentioning
confidence: 96%
“…Following the discovery of anisotropy in transformers (Gao et al, 2019;Ethayarajh, 2019), different isotropy calibration methods have been developed to correct this phenomenon. Gao et al (2019) and Zhang et al (2020) introduced regularization objectives that affect the embedding distances. Zhou et al (2021) presented a module inspired by batch-norm that regularizes the embeddings towards isotropic representations.…”
Section: Related Workmentioning
confidence: 99%