Does language help generalization in vision models?

Devillers, Benjamin; Choksi, Bhavin; Bielawski, Romain; VanRullen, Rufin

doi:10.18653/v1/2021.conll-1.13

Cited by 10 publications

(14 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, our work studies the robustness of CLIP and how language specifically affects its capability to generalize out of distribution. An important difference between our experiments and those of Devillers et al [12] is that we control for in-distribution accuracy in our comparison between the models to separate accuracy and robustness. Furthermore, Andreassen et al [1] study the effect of fine-tuning on robustness.…”

Section: Additional Related Workmentioning

confidence: 97%

“…Related recent work also studies exactly where the generalization capabilities of CLIP come from. Devillers et al [12] investigate whether models that use multimodal information (such as text & images) have superior generalization capabilities -as measured by few-shot and linear probe performance -to models that use only one type of information (images or text). Their analysis found that for both few-shot and linear probe settings there was no consistent advantage of multimodal models over models using only a single modality.…”

Section: Additional Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

Fang¹,

Ilharco²,

Wortsman³

et al. 2022

Preprint

View full text Add to dashboard Cite

Contrastively trained image-text models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these image-text models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr, to enable further controlled experiments of language-image training.

show abstract

Section: Additional Related Workmentioning

confidence: 97%

Section: Additional Related Workmentioning

confidence: 99%

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

Fang¹,

Ilharco²,

Wortsman³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Comparing our reimplementation of CLIP and CLOOB with ResNet-50 encoders, we observe mixed results. The reason for this effect might be attributed to the observed task-dependence of multimodal models (Devillers et al, 2021). Another potential reason is that the benefit of the restrictions to more reliable patterns that occur in both modalities does not directly translate to an evaluation of just the encoding part of one modality.…”

Section: A33 Datasetsmentioning

confidence: 99%

“…The CLIP model guided generative models via an additional training objective (Bau et al, 2021;Galatolo et al, 2021;Frans et al, 2021) and improved clustering of latent representations (Pakhomov et al, 2021). It is used in studies of out of distribution performance (Devillers et al, 2021;Milbich et al, 2021;, of fine-tuning robustness , of zero-shot prompts and of adversarial attacks to uncurated datasets (Carlini & Terzis, 2021). It stirred discussions about more holistic evaluation schemes in computer vision (Agarwal et al, 2021).…”

Section: A4 Brief Review Of Modern Hopfield Networkmentioning

confidence: 99%

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Fürst¹,

Rumetshofer²,

Lehner³

et al. 2021

Preprint

View full text Add to dashboard Cite

Contrastive learning with the InfoNCE objective is exceptionally successful in various self-supervised learning tasks. Recently, the CLIP model yielded impressive results on zero-shot transfer learning when using InfoNCE for learning visual representations from natural language supervision. However, InfoNCE as a lower bound on the mutual information has been shown to perform poorly for high mutual information. In contrast, the InfoLOOB upper bound (leave one out bound) works well for high mutual information but suffers from large variance and instabilities. We introduce "Contrastive Leave One Out Boost" (CLOOB), where modern Hopfield networks boost learning with the InfoLOOB objective. Modern Hopfield networks replace the original embeddings by retrieved embeddings in the InfoLOOB objective. The retrieved embeddings give InfoLOOB two assets. Firstly, the retrieved embeddings stabilize InfoLOOB, since they are less noisy and more similar to one another than the original embeddings. Secondly, they are enriched by correlations, since the covariance structure of embeddings is reinforced through retrievals. We compare CLOOB to CLIP after learning on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.

show abstract

“…Unfortunately, this does not always happen in practice. Recently, Devillers et al (2021) evaluated the visual generalization abilities of CLIP (Radford et al, 2021), a popular network trained with a contrastive learning objective on more than 400M image-caption pairs scraped from the web, and other multimodal models (Sariyildiz et al, 2020;Desai and Johnson, 2020). They showed that for standard object classification tasks (e.g.…”

Section: Introductionmentioning

confidence: 99%

When does CLIP generalize better than unimodal models? When judging human-centric concepts

Bielawski¹,

Devillers²,

Cruys³

et al. 2022

Proceedings of the 7th Workshop on Representation Learning for NLP

Self Cite

View full text Add to dashboard Cite

CLIP, a vision-language network trained with a multimodal contrastive learning objective on a large dataset of images and captions, has demonstrated impressive zero-shot ability in various tasks. However, recent work showed that in comparison to unimodal (visual) networks, CLIP's multimodal training does not benefit generalization (e.g. few-shot or transfer learning) for standard visual classification tasks such as object, street numbers or animal recognition. Here, we hypothesize that CLIP's improved unimodal generalization abilities may be most prominent in domains that involve human-centric concepts (cultural, social, aesthetic, affective...); this is because CLIP's training dataset is mainly composed of image annotations made by humans for other humans. To evaluate this, we use 3 tasks that require judging human-centric concepts: sentiment analysis on tweets, genre classification on books or movies. We introduce and publicly release a new multimodal dataset for movie genre classification. We compare CLIP's visual stream against two visually trained networks and CLIP's textual stream against two linguistically trained networks, as well as multimodal combinations of these networks. We show that CLIP generally outperforms other networks, whether using one or two modalities. We conclude that CLIP's multimodal training is beneficial for both unimodal and multimodal tasks that require classification of human-centric concepts.

show abstract

Does language help generalization in vision models?

Cited by 10 publications

References 11 publications

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

When does CLIP generalize better than unimodal models? When judging human-centric concepts

Contact Info

Product

Resources

About