2021
DOI: 10.48550/arxiv.2110.05208
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Abstract: Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
64
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 34 publications
(64 citation statements)
references
References 41 publications
0
64
0
Order By: Relevance
“…Training Available Image code data encoder CLIP [18] No YFCC15M-V1 † ViT, ResNet SLIP [15] Yes YFCC15M-V1 † ViT DeCLIP [9] Yes YFCC15M-V2 ViT, ResNet FILIP [30] No -ViT els from language supervision, or more specifically, imagetext pairs. Basically, CLIP adopts the contrastive loss to push the embeddings of matched image-text pairs together while pushing those of non-matched pairs apart.…”
Section: Methodsmentioning
confidence: 99%
“…Training Available Image code data encoder CLIP [18] No YFCC15M-V1 † ViT, ResNet SLIP [15] Yes YFCC15M-V1 † ViT DeCLIP [9] Yes YFCC15M-V2 ViT, ResNet FILIP [30] No -ViT els from language supervision, or more specifically, imagetext pairs. Basically, CLIP adopts the contrastive loss to push the embeddings of matched image-text pairs together while pushing those of non-matched pairs apart.…”
Section: Methodsmentioning
confidence: 99%
“…Recent vision-language models [13,24,33,40] bridge the two modalities by learning two encoders jointly. Also, the models are now built with much larger neural networks.…”
Section: Related Workmentioning
confidence: 99%
“…After consuming 400 million data pairs, the CLIP model demonstrates a remarkable zero-shot image recognition capability. Similar to CoOp [62], our approach is orthogonal to the research of CLIP-like models [13,24,33,40], aiming to offer an efficient solution for adapting pre-trained vision-language models to downstream applications.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The success of CLIP and ALIGN has enlightened many downstream vision-language tasks. For instance, DeCLIP [35] proposes to utilize self-, multi-view, and nearest-neighbor supervisions among the image-text pairs for data efficient pretraining of CLIP. On visual classification tasks, CLIP-Adapter [15] argues that fine-tuning contrastive vision-language models with linear adapters is a better alternative to prompt tuning.…”
Section: Related Workmentioning
confidence: 99%