Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Li, Yangguang; Liang, Feng; Zhao, Li-Chen; Cui, Yufeng; Ouyang, Wanli; Shao, Jing; Yu, Fei; Yan, Junjie

doi:10.48550/arxiv.2110.05208

Cited by 34 publications

(64 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Training Available Image code data encoder CLIP [18] No YFCC15M-V1 † ViT, ResNet SLIP [15] Yes YFCC15M-V1 † ViT DeCLIP [9] Yes YFCC15M-V2 ViT, ResNet FILIP [30] No -ViT els from language supervision, or more specifically, imagetext pairs. Basically, CLIP adopts the contrastive loss to push the embeddings of matched image-text pairs together while pushing those of non-matched pairs apart.…”

Section: Methodsmentioning

confidence: 99%

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Cui¹,

Zhao²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is because researchers do not choose consistent training recipes and even use different data, hampering the fair comparison between different methods. In this work, we propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. We conduct a comprehensive analysis of three key factors: data, supervision, and model architecture. We find considerable intuitive or counter-intuitive insights: (1). Data quality has a significant impact on performance. (2). Certain supervision has different effects for Convolutional Networks (ConvNets) and Vision Transformers (ViT). Applying more proper supervision can effectively improve the performance of CLIP.(3). Curtailing the text encoder reduces the training cost but not much affect the final performance. Moreover, we further combine DeCLIP [9] with FILIP [30], bringing us the strongest variant DeFILIP. The CLIP-benchmark would be released at: https://github.com/Sense-GVT/ DeCLIP for future CLIP research.

show abstract

Section: Methodsmentioning

confidence: 99%

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Cui¹,

Zhao²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recent vision-language models [13,24,33,40] bridge the two modalities by learning two encoders jointly. Also, the models are now built with much larger neural networks.…”

Section: Related Workmentioning

confidence: 99%

“…After consuming 400 million data pairs, the CLIP model demonstrates a remarkable zero-shot image recognition capability. Similar to CoOp [62], our approach is orthogonal to the research of CLIP-like models [13,24,33,40], aiming to offer an efficient solution for adapting pre-trained vision-language models to downstream applications.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Conditional Prompt Learning for Vision-Language Models

Zhou¹,

Yang²,

Loy³

et al. 2022

Preprint

View full text Add to dashboard Cite

With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning-a recent trend in NLP-to the vision domain for adapting pre-trained visionlanguage models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector). Compared to CoOp's static prompts, our dynamic prompts adapt to each instance and are thus less sensitive to class shift. Extensive experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset; and yields stronger domain generalization performance as well. Code is available at https://github.com/ KaiyangZhou/CoOp.

show abstract

“…The success of CLIP and ALIGN has enlightened many downstream vision-language tasks. For instance, DeCLIP [35] proposes to utilize self-, multi-view, and nearest-neighbor supervisions among the image-text pairs for data efficient pretraining of CLIP. On visual classification tasks, CLIP-Adapter [15] argues that fine-tuning contrastive vision-language models with linear adapters is a better alternative to prompt tuning.…”

Section: Related Workmentioning

confidence: 99%

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Ma¹,

Geng²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Existing approaches either perform class re-balancing strategies or directly improve network modules to address the problem. However, they still train models with a finite set of predefined labels, limiting their supervision information and restricting their transferability to novel instances. Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition. With open-vocabulary supervisions, pretrained contrastive vision-language models learn powerful multimodal representations that are promising to handle data deficiency and unseen concepts. By calculating the semantic similarity between visual and text inputs, visual recognition is converted to a vision-language matching problem. Inspired by this, we propose BALLAD to leverage contrastive vision-language models for long-tailed recognition. We first continue pretraining the vision-language backbone through contrastive learning on a specific long-tailed target dataset. Afterward, we freeze the backbone and further employ an additional adapter layer to enhance the representations of tail classes on balanced training samples built with re-sampling strategies. Extensive experiments have been conducted on three popular long-tailed recognition benchmarks. As a result, our simple and effective approach sets the new state-of-the-art performances and outperforms competitive baselines with a large margin. Code is released at https://github.com/gaopengcuhk/BALLAD.

show abstract

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Cited by 34 publications

References 41 publications

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Conditional Prompt Learning for Vision-Language Models

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Contact Info

Product

Resources

About