Proceedings of the 25th Conference on Computational Natural Language Learning 2021
DOI: 10.18653/v1/2021.conll-1.13
|View full text |Cite
|
Sign up to set email alerts
|

Does language help generalization in vision models?

Abstract: Vision models trained on multimodal datasets can benefit from the wide availability of large image-caption datasets. A recent model (CLIP) was found to generalize well in zeroshot and transfer learning settings. This could imply that linguistic or "semantic grounding" confers additional generalization abilities to the visual feature space. Here, we systematically evaluate various multimodal architectures and vision-only models in terms of unsupervised clustering, few-shot learning, transfer learning and advers… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
12
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 11 publications
1
12
1
Order By: Relevance
“…In contrast, our work studies the robustness of CLIP and how language specifically affects its capability to generalize out of distribution. An important difference between our experiments and those of Devillers et al [12] is that we control for in-distribution accuracy in our comparison between the models to separate accuracy and robustness. Furthermore, Andreassen et al [1] study the effect of fine-tuning on robustness.…”
Section: Additional Related Workmentioning
confidence: 97%
See 1 more Smart Citation
“…In contrast, our work studies the robustness of CLIP and how language specifically affects its capability to generalize out of distribution. An important difference between our experiments and those of Devillers et al [12] is that we control for in-distribution accuracy in our comparison between the models to separate accuracy and robustness. Furthermore, Andreassen et al [1] study the effect of fine-tuning on robustness.…”
Section: Additional Related Workmentioning
confidence: 97%
“…Related recent work also studies exactly where the generalization capabilities of CLIP come from. Devillers et al [12] investigate whether models that use multimodal information (such as text & images) have superior generalization capabilities -as measured by few-shot and linear probe performance -to models that use only one type of information (images or text). Their analysis found that for both few-shot and linear probe settings there was no consistent advantage of multimodal models over models using only a single modality.…”
Section: Additional Related Workmentioning
confidence: 99%
“…Comparing our reimplementation of CLIP and CLOOB with ResNet-50 encoders, we observe mixed results. The reason for this effect might be attributed to the observed task-dependence of multimodal models (Devillers et al, 2021). Another potential reason is that the benefit of the restrictions to more reliable patterns that occur in both modalities does not directly translate to an evaluation of just the encoding part of one modality.…”
Section: A33 Datasetsmentioning
confidence: 99%
“…The CLIP model guided generative models via an additional training objective (Bau et al, 2021;Galatolo et al, 2021;Frans et al, 2021) and improved clustering of latent representations (Pakhomov et al, 2021). It is used in studies of out of distribution performance (Devillers et al, 2021;Milbich et al, 2021;, of fine-tuning robustness , of zero-shot prompts and of adversarial attacks to uncurated datasets (Carlini & Terzis, 2021). It stirred discussions about more holistic evaluation schemes in computer vision (Agarwal et al, 2021).…”
Section: A4 Brief Review Of Modern Hopfield Networkmentioning
confidence: 99%
“…Unfortunately, this does not always happen in practice. Recently, Devillers et al (2021) evaluated the visual generalization abilities of CLIP (Radford et al, 2021), a popular network trained with a contrastive learning objective on more than 400M image-caption pairs scraped from the web, and other multimodal models (Sariyildiz et al, 2020;Desai and Johnson, 2020). They showed that for standard object classification tasks (e.g.…”
Section: Introductionmentioning
confidence: 99%