2021
DOI: 10.48550/arxiv.2110.11316
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Abstract: Contrastive learning with the InfoNCE objective is exceptionally successful in various self-supervised learning tasks. Recently, the CLIP model yielded impressive results on zero-shot transfer learning when using InfoNCE for learning visual representations from natural language supervision. However, InfoNCE as a lower bound on the mutual information has been shown to perform poorly for high mutual information. In contrast, the InfoLOOB upper bound (leave one out bound) works well for high mutual information bu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
11
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(12 citation statements)
references
References 21 publications
0
11
0
Order By: Relevance
“…Recent vision-language models [13,24,33,40] bridge the two modalities by learning two encoders jointly. Also, the models are now built with much larger neural networks.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Recent vision-language models [13,24,33,40] bridge the two modalities by learning two encoders jointly. Also, the models are now built with much larger neural networks.…”
Section: Related Workmentioning
confidence: 99%
“…After consuming 400 million data pairs, the CLIP model demonstrates a remarkable zero-shot image recognition capability. Similar to CoOp [62], our approach is orthogonal to the research of CLIP-like models [13,24,33,40], aiming to offer an efficient solution for adapting pre-trained vision-language models to downstream applications.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…b. Vision-Language Pretraining: Joint vision-language pretraining (VLP) is an active research area [1,19,39,63,68] where the availability of large-scale image-text datasets e.g., YFCC100M [71] and Conceptual Captions [9,67] has played a key role in its progress. Although multiple concurrent works are being proposed to further improve VLP models [75], our work is different from them in a few important ways.…”
Section: Related Workmentioning
confidence: 99%
“…The convergence of self-supervised pretraining techniques in natural language processing and computer vision have brought about a renaissance of cross-modal representation learning methods [1,19,30,39,52,63,68,75] where largescale weakly correlated multimodal data (e.g., image-text pairs) is used to learn cross-modal representations using contrastive learning techniques. In particular, the recently proposed CLIP [63] model has garnered significant attention due to its impressive zero-shot recognition ability and excellent transfer performance on downstream tasks.…”
Section: Introductionmentioning
confidence: 99%