On Isotropy Calibration of Transformer Models

Ding, Yongsheng; Martinkus, Karolis; Pascual, Damián; Clematide, Simon; Wattenhofer, Roger

doi:10.18653/v1/2022.insights-1.1

Cited by 4 publications

(5 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cai et al (2020) showed that, in spite of BERT embeddings having global anisotropy, each cluster in the embedding space is isotropic, and that this local isotropy could be enough for Transformer models to achieve their full representation power. This hypothesis is supported by recent empirical results from (Ding et al, 2022). If the anisotropy comes from the existence of different clusters, and these clusters encode non-semantic information like token frequency, this can be matched with the biases described by Jiang et al (2022) and the representation degeneration by Gao et al (2019).…”

Section: Related Worksupporting

confidence: 65%

“…This is, anisotropy is not a problem if it is the same for all tokens. If this is true, then isotropy correction techniques should not increase the performance in semantic tasks of these models, which has been empirically proven by Ding et al (2022); Jiang et al (2022). In the next set of experiments, we further support this idea through empirical evidence.…”

Section: Conclusion On Bias Analysissupporting

confidence: 53%

“…We don't think that there is a causality relation between isotropy and semantic isometry. This means that isotropy correction methods will not achieve substantial improvements over their base models, which has been recently proven by Ding et al (2022). Therefore, it could be said that assuming that the lack of isotropy of the embedding spaces is the cause of the lack of semantics is a post-hoc fallacy.…”

Section: Discussionmentioning

confidence: 99%

See 2 more Smart Citations

Is anisotropy really the cause of BERT embeddings not being semantic?

Baggetto¹,

Fresno²

2022

Findings of the Association for Computational Linguistics: EMNLP 2022

View full text Add to dashboard Cite

In this paper we conduct a set of experiments aimed to improve our understanding of the lack of semantic isometry in BERT, i.e. the lack of correspondence between the embedding and meaning spaces of its contextualized word representations. Our empirical results show that, contrary to popular belief, the anisotropy is not the root cause of the poor performance of these contextual models' embeddings in semantic tasks. What does affect both the anisotropy and semantic isometry is a set of known biases: frequency, subword, punctuation, and case. For each one of them, we measure its magnitude and the effect of its removal, showing that these biases contribute but do not completely explain the phenomenon of anisotropy and lack of semantic isometry of these contextual language models.

show abstract

Section: Related Worksupporting

confidence: 65%

Section: Conclusion On Bias Analysissupporting

confidence: 53%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Is anisotropy really the cause of BERT embeddings not being semantic?

Baggetto¹,

Fresno²

2022

Findings of the Association for Computational Linguistics: EMNLP 2022

View full text Add to dashboard Cite

show abstract

“…To this purpose, they propose regularization terms that hamper the singular value decay of the embedding matrix. However, despite the success of these optimization tricks in lowering the anisotropy of Transformer representations, Ding et al (2022) have recently shown that they do not bring any improvement, relying on several tasks like summarization and sentence similarity (STS). They even observed a certain deterioration of the performance brought by anisotropy mitigation techniques.…”

Section: Introductionmentioning

confidence: 99%

Is Anisotropy Truly Harmful? A Case Study on Text Clustering

Ait-Saada¹,

Nadif²

2023

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

View full text Add to dashboard Cite

In the last few years, several studies have been devoted to dissecting dense text representations in order to understand their effectiveness and further improve their quality. Particularly, the anisotropy of such representations has been observed, which means that the directions of the word vectors are not evenly distributed across the space but rather concentrated in a narrow cone. This has led to several attempts to counteract this phenomenon both on static and contextualized text representations. However, despite this effort, there is no established relationship between anisotropy and performance. In this paper, we aim to bridge this gap by investigating the impact of different transformations on both the isotropy and the performance in order to assess the true impact of anisotropy. To this end, we rely on the clustering task as a means of evaluating the ability of text representations to produce meaningful groups. Thereby, we empirically show a limited impact of anisotropy on the expressiveness of sentence representations both in terms of directions and L 2 closeness.

show abstract

“… Following Cai et al (2021) this global estimate of ansiotropy does not rule out the possibility of distinct and locally isotropic clusters in the embedding space Ding et al (2022). show that isotropy calibration methods(Gao et al, 2019;Li et al, 2020) do not lead to consistent improvements on downstream tasks when models already benefit from local isotropy.…”

mentioning

confidence: 98%

Text Rendering Strategies for Pixel Language Models

Lotz,

Salesky,

Rust

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling. However, recent approaches use text renderers that produce a large set of almost-equivalent input patches, which may prove sub-optimal for downstream tasks, due to redundancy in the input representations. In this paper, we investigate four approaches to rendering text in the PIXEL model (Rust et al., 2023), and find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on tokenlevel or multilingual tasks. This new rendering strategy also makes it possible to train a more compact model with only 22M parameters that performs on par with the original 86M parameter model. Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias, highlighting the connections between image patchand tokenization-based language models. Megabyte: Predicting million-byte sequences with multiscale transformers. arXiv preprint.

show abstract

On Isotropy Calibration of Transformer Models

Cited by 4 publications

References 16 publications

Is anisotropy really the cause of BERT embeddings not being semantic?

Is anisotropy really the cause of BERT embeddings not being semantic?

Is Anisotropy Truly Harmful? A Case Study on Text Clustering

Text Rendering Strategies for Pixel Language Models

Contact Info

Product

Resources

About