2021
DOI: 10.48550/arxiv.2112.12750
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SLIP: Self-supervision meets Language-Image Pre-training

Abstract: Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CL… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
74
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 36 publications
(76 citation statements)
references
References 32 publications
0
74
0
Order By: Relevance
“…Training Available Image code data encoder CLIP [18] No YFCC15M-V1 † ViT, ResNet SLIP [15] Yes YFCC15M-V1 † ViT DeCLIP [9] Yes YFCC15M-V2 ViT, ResNet FILIP [30] No -ViT els from language supervision, or more specifically, imagetext pairs. Basically, CLIP adopts the contrastive loss to push the embeddings of matched image-text pairs together while pushing those of non-matched pairs apart.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Training Available Image code data encoder CLIP [18] No YFCC15M-V1 † ViT, ResNet SLIP [15] Yes YFCC15M-V1 † ViT DeCLIP [9] Yes YFCC15M-V2 ViT, ResNet FILIP [30] No -ViT els from language supervision, or more specifically, imagetext pairs. Basically, CLIP adopts the contrastive loss to push the embeddings of matched image-text pairs together while pushing those of non-matched pairs apart.…”
Section: Methodsmentioning
confidence: 99%
“…Witnessing its great success, researchers continue to push the frontier of CLIP. For instance, SLIP [15], De-CLIP [9] and FILIP [30] achieve considerable improvements via embracing different kinds of supervision within the image-text pairs. However, it remains challenging to make fair comparison between these methods.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…This paradigm is also referred to as transfer learning. Recently, image-text pre-training has become increasingly popular in computer vision as a pre-training task [72,47,65,69]. Recent work has explored alternative strategies for adapting these models to specific target tasks [106,35,105], for instance via a lightweight residual feature adapter.…”
Section: Related Workmentioning
confidence: 99%