EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Wang, Jue; Wang, Haofan; Deng, Jincan; Wu, Weijia; Zhang, Debing

doi:10.48550/arxiv.2109.04699

Cited by 3 publications

(6 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[7] proposes a distillation based loss to better handle the noise in the dataset. Efficient-CLIP [44] is a concurrent work that uses text-only data to perform unimodal MLM task. DeCLIP [1] is another concurrent work that uses multiple unimodal SSL losses in addition to removing the noise from negatives with the help of nearest neighbors [11,22].…”

Section: Related Workmentioning

confidence: 99%

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

Tejankar¹,

Sanjabi²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Using natural language as a supervision for training visual recognition models holds great promise. Recent works have shown that if such supervision is used in the form of alignment between images and captions in large training datasets, then the resulting aligned models perform well on zero-shot classification as downstream tasks 2 . In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models. Through extensive and careful experiments, we show that: 1) A simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset. Surprisingly, we observe that this approach improves the zero-shot classification performance when combined with word balancing. 2) Using a BoW pretrained model, we can obtain more training data by generating pseudo-BoW captions on images that do not have a caption. Models trained on images with real and pseudo-BoW captions achieve stronger zero-shot performance. On ImageNet-1k zero-shot evaluation, our best model, that uses only 3M image-caption pairs, performs on-par with a CLIP model trained on 15M image-caption pairs (31.5% vs 31.3%).

show abstract

Section: Related Workmentioning

confidence: 99%

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

Tejankar¹,

Sanjabi²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Vision-Language Pretraining: Joint vision-language pretraining (VLP) is an active research area [1,19,39,63,68] where the availability of large-scale image-text datasets e.g., YFCC100M [71] and Conceptual Captions [9,67] has played a key role in its progress. Although multiple concurrent works are being proposed to further improve VLP models [75], our work is different from them in a few important ways. Specifically, unlike EfficientCLIP [75] that proposes an ensemble approach to obtain a less noisy data subset for cross-modal training, our method attempts to sidestep this problem altogether by re-purposing as opposed to completely removing noisy data.…”

Section: Related Workmentioning

confidence: 99%

“…Although multiple concurrent works are being proposed to further improve VLP models [75], our work is different from them in a few important ways. Specifically, unlike EfficientCLIP [75] that proposes an ensemble approach to obtain a less noisy data subset for cross-modal training, our method attempts to sidestep this problem altogether by re-purposing as opposed to completely removing noisy data. Similarly, DeCLIP [39] improves on the data-efficiency of CLIP [63] by leveraging intra-model contrastive learning along with a nearestneighbor feature bank to augment negatives.…”

Section: Related Workmentioning

confidence: 99%

“…The convergence of self-supervised pretraining techniques in natural language processing and computer vision have brought about a renaissance of cross-modal representation learning methods [1,19,30,39,52,63,68,75] where largescale weakly correlated multimodal data (e.g., image-text pairs) is used to learn cross-modal representations using contrastive learning techniques. In particular, the recently proposed CLIP [63] model has garnered significant attention due to its impressive zero-shot recognition ability and excellent transfer performance on downstream tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Specifically, we propose a simple yet effective framework for robust contrastive language-image pretraining that uses progressive self-distillation and soft image-text alignment targets to more efficiently learn from noisy data. Instead of explicitly finding, correcting or even pruning noisy correspondences [75,88], our joint student-teacher model dynamically generates a new set of soft-alignments for a random subset of images and captions in every minibatch. This enables our method to model many-to-many relationships while simultaneously re-calibrating potentially poorly matched instances without needing to identify them.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

Andonian¹,

Chen²,

Hamid³

2022

Preprint

View full text Add to dashboard Cite

The learning objective of vision-language approach of CLIP [63] does not effectively account for the noisy manyto-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To address this challenge, we introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data. Our model distills its own knowledge to dynamically generate soft-alignment targets for a subset of images and captions in every minibatch, which are then used to update its parameters. Extensive evaluation across 14 benchmark datasets shows that our method consistently outperforms its CLIP counterpart in multiple settings, including: (a) zero-shot classification, (b) linear probe transfer, and (c) image-text retrieval, without incurring extra computational cost. Analysis using an ImageNet-based robustness test-bed [70] reveals that our method offers better effective robustness to natural distribution shifts compared to both ImageNet-trained models and CLIP itself. Lastly, pretraining with datasets spanning two orders of magnitude in size shows that our improvements over CLIP tend to scale with number of training examples.

show abstract

The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis

Barraco

Cornia

Cascianelli

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.

show abstract

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Cited by 3 publications

References 31 publications

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis

Contact Info

Product

Resources

About