2022
DOI: 10.48550/arxiv.2205.06226
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

Abstract: Recently the surprising discovery of Bootstrap Your Own Latent (BYOL) method by Grill et al.shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network architecture, which breaks the symmetry between the positive pairs. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when trivial collapsed global optimal solutions exist, neural networks trained by (stochastic) gradient descent can still learn competitive re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 36 publications
0
6
0
Order By: Relevance
“…In this regard LID is characterized by contrastive learning, and this feature consistency approach is important in that it also plays a semi-supervised role in ConstraInver, hence it is called contrastive semi-supervised learning. We experimentally demonstrate that the performance of projected feature consistency is significantly higher than that of unprojected and label consistency, which we hypothesize is due to the filtering effect of the projection head on meaningful information [66], [67]. We include at least one sample containing real wells in each batch to ensure that the learning process does not deviate from the right track.…”
Section: ) Log Information Diffusionmentioning
confidence: 98%
“…In this regard LID is characterized by contrastive learning, and this feature consistency approach is important in that it also plays a semi-supervised role in ConstraInver, hence it is called contrastive semi-supervised learning. We experimentally demonstrate that the performance of projected feature consistency is significantly higher than that of unprojected and label consistency, which we hypothesize is due to the filtering effect of the projection head on meaningful information [66], [67]. We include at least one sample containing real wells in each batch to ensure that the learning process does not deviate from the right track.…”
Section: ) Log Information Diffusionmentioning
confidence: 98%
“…First, the new loss function keeps the local similarity between the negative pair of data (X i , X i − ), while the existing self-supervised learning methods either ignore the negative pair (in non-contrastive methods) or push the negative pair far from each other despite their Euclidean distance (in contrastive methods). Second, the regularization part in (5) can help avoid dimensional collapse observed in some self-supervised learning methods (Hua et al, 2021;Wen and Li, 2022).…”
Section: A Computationally Efficient Formulationmentioning
confidence: 99%
“…Since the introduction of self-supervised learning, recent works have tried to theoretically understand why the learned data representation from augmented data can help improve the downstream analysis (Arora et al, 2019;Tsai et al, 2020;Wei et al, 2020;Tian et al, 2020b;Tosh et al, 2021;Wen and Li, 2021;HaoChen et al, 2021;Wang, 2022;Wen and Li, 2022;Balestriero and LeCun, 2022). Most of these theoretical works focus on the setting where the special conditional independence structure for the augmented data is assumed.…”
Section: Introduction 1data Augmentation In Representation Learningmentioning
confidence: 99%
“…From an empirical side, Chen & He (2021) think that the predictor helps approximate the expectation over augmentations, and Zhang et al (2022a) take a center-residual decomposition of representations for analyzing the collapse. From a theoretical perspective, Tian et al (2021) analyze the dynamics of predictor weights under simple linear networks, and Wen & Li (2022) obtain optimization guarantees for two-layer nonlinear networks. These theoretical discussions often need strong assumptions on the data distribution (e.g., standard normal (Tian et al, 2021)) and augmentations (e.g., random masking (Wen & Li, 2022)).…”
Section: Introductionmentioning
confidence: 99%
“…From a theoretical perspective, Tian et al (2021) analyze the dynamics of predictor weights under simple linear networks, and Wen & Li (2022) obtain optimization guarantees for two-layer nonlinear networks. These theoretical discussions often need strong assumptions on the data distribution (e.g., standard normal (Tian et al, 2021)) and augmentations (e.g., random masking (Wen & Li, 2022)). Besides, their analyses are often problem-specific, which is hardly extendable to other non-contrastive variants without a predictor, e.g., DINO.…”
Section: Introductionmentioning
confidence: 99%