2022
DOI: 10.48550/arxiv.2203.05796
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Abstract: Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is because researchers do not choose consistent training recipes and even use different data, hampering the fair comparison between different methods. In this work, we propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. We conduct a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 25 publications
0
5
0
Order By: Relevance
“…• Inter-Modality Contrastive Learning: The contrastive learning is widely used for intermodality relation modelling, such as the CLIP [77] and its following-up works [19,104,[134][135][136][137][138]. The representative work SCALE [104] is trained with Self-harmonized Inter-Modality Contrastive Learning (SIMCL), which can be written as:…”
Section: Modality Interactive Learningmentioning
confidence: 99%
“…• Inter-Modality Contrastive Learning: The contrastive learning is widely used for intermodality relation modelling, such as the CLIP [77] and its following-up works [19,104,[134][135][136][137][138]. The representative work SCALE [104] is trained with Self-harmonized Inter-Modality Contrastive Learning (SIMCL), which can be written as:…”
Section: Modality Interactive Learningmentioning
confidence: 99%
“…Pretraining Datasets To make a fair comparison with the state-of-the-art contrastive vision-language pretraining approaches, we adopt the YFCC15M benchmark proposed in (Cui et al, 2022) which builds on a subset from YFCC100M (Thomee et al, 2016) consisting of 15M image-text pairs. In addition, we construct a 30M version of pretraining data by including Conceptual Caption 3M (CC3M) (Sharma et al, 2018) and 12M (CC12M) (Changpinyo et al, 2021).…”
Section: Experimental Settingsmentioning
confidence: 99%
“…We first compare HiCLIP with state-of-the-art CLIP family approaches on YFCC15M benchmark (Cui et al, 2022) 2022) apply multiple single-modal self-supervised tasks in addition to CLIP, we incorporated the same objectives into our hierarchy-aware model for a fair comparison (denoted as HiDeCLIP). By combining the contrastive learning and self-supervised learning loss functions, our HiDeCLIP further improves the zero-shot ImageNet classification performance by 2.7% over DeCLIP, and overall 13.1% higher than CLIP.…”
Section: Visual Recognitionmentioning
confidence: 99%
“…To gain efficiency, cross-modality parameter sharing [52] has been proposed. Moreover, some approaches go towards the direction of fully exploiting the noisy training data by modifying the training objective with self-supervision [30], within-modality loss terms [9,23],…”
Section: Related Workmentioning
confidence: 99%