SLIP: Self-supervision meets Language-Image Pre-training

Mu, Norman; Kirillov, Alexander M.; Wagner, David; Xie, Sihong

doi:10.48550/arxiv.2112.12750

Cited by 36 publications

(76 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Training Available Image code data encoder CLIP [18] No YFCC15M-V1 † ViT, ResNet SLIP [15] Yes YFCC15M-V1 † ViT DeCLIP [9] Yes YFCC15M-V2 ViT, ResNet FILIP [30] No -ViT els from language supervision, or more specifically, imagetext pairs. Basically, CLIP adopts the contrastive loss to push the embeddings of matched image-text pairs together while pushing those of non-matched pairs apart.…”

Section: Methodsmentioning

confidence: 99%

“…Witnessing its great success, researchers continue to push the frontier of CLIP. For instance, SLIP [15], De-CLIP [9] and FILIP [30] achieve considerable improvements via embracing different kinds of supervision within the image-text pairs. However, it remains challenging to make fair comparison between these methods.…”

Section: Methodsmentioning

confidence: 99%

“…As we can see from Tab. 1, although CLIP [18], DeCLIP [9] and SLIP [15] use the same amount of 15 million data from YFCC [22], they adopt different filtering strategies. Moreover, methods [7,9,17,18,30] crawl their datasets from the Internet, making the fair comparison more difficult.…”

Section: Methodsmentioning

confidence: 99%

“…Intuitively, fine-grained alignment needs the image features to be non-overlapped, which is unachievable for ConvNets. For ViT image encoder, aggregating self-supervision [9,15], multi-view supervision [9], nearest-neighbor supervision [9] and fine-grained alignment supervision [30] brings us the strongest variant DeFILIP.…”

Section: Methodsmentioning

confidence: 99%

See 3 more Smart Citations

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Cui¹,

Zhao²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is because researchers do not choose consistent training recipes and even use different data, hampering the fair comparison between different methods. In this work, we propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. We conduct a comprehensive analysis of three key factors: data, supervision, and model architecture. We find considerable intuitive or counter-intuitive insights: (1). Data quality has a significant impact on performance. (2). Certain supervision has different effects for Convolutional Networks (ConvNets) and Vision Transformers (ViT). Applying more proper supervision can effectively improve the performance of CLIP.(3). Curtailing the text encoder reduces the training cost but not much affect the final performance. Moreover, we further combine DeCLIP [9] with FILIP [30], bringing us the strongest variant DeFILIP. The CLIP-benchmark would be released at: https://github.com/Sense-GVT/ DeCLIP for future CLIP research.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Cui¹,

Zhao²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…This paradigm is also referred to as transfer learning. Recently, image-text pre-training has become increasingly popular in computer vision as a pre-training task [72,47,65,69]. Recent work has explored alternative strategies for adapting these models to specific target tasks [106,35,105], for instance via a lightweight residual feature adapter.…”

Section: Related Workmentioning

confidence: 99%

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Wortsman¹,

Ilharco²,

Yitzhak³

et al. 2022

Preprint

View full text Add to dashboard Cite

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs-we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. As a highlight, the resulting ViT-G model attains 90.94% top-1 accuracy on ImageNet, a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically.

show abstract

VTC: Improving Video-Text Retrieval with User Comments

Hanu¹,

Thewlis²,

Asano³

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are well-correlated with the content. Thus, current video-text retrieval literature largely focuses on video titles or audio transcripts, while ignoring user comments, since users often tend to discuss topics only vaguely related to the video. Despite the ubiquity of user comments online, there is currently no multi-modal representation learning datasets that includes comments. In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations. Project page: https://unitaryai.github. io/vtc-paper.

show abstract

SLIP: Self-supervision meets Language-Image Pre-training

Cited by 36 publications

References 32 publications

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

VTC: Improving Video-Text Retrieval with User Comments

Contact Info

Product

Resources

About