Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Lewis, Martha; Yu, Qinan; Merullo, Jack; Pavlick, Ellie

doi:10.48550/arxiv.2212.10537

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…foreground objects, lack compositionality, and do not understand concepts of negation. Research efforts mitigating these shortcomings [73,74] are ripe for exploration. Third, we anticipate ConceptFusion to inherit the limitations and biases of foundation models [5,75], warranting further investigations for potential harm as well as research into AI safety and alignment [76,77].…”

Section: Discussionmentioning

confidence: 99%

ConceptFusion: Open-set multimodal 3D mapping

Jatavallabhula,

Kuwajerwala,

et al. 2023

Robotics: Science and Systems XIX

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

ConceptFusion: Open-set multimodal 3D mapping

Jatavallabhula,

Kuwajerwala,

et al. 2023

Robotics: Science and Systems XIX

View full text Add to dashboard Cite

“…A subsequent work (Diwan et al, 2022) shows that Winoground requires not only compositional language understanding but also other abilities such as sophisticated commonsense reasoning and locating small objects in low resolution images, which most vision and language models currently lack. The work (Lewis et al, 2023) is the most relevant to our research, although it primarily deals with toy datasets. Our work also reveals brittleness of vision-language models through the lens of CAB, which has been overlooked in the past.…”

Section: Compositionality In Vision and Language Modelsmentioning

confidence: 99%

When are Lemons Purple? The Concept Association Bias of Vision-Language Models

Tang,

Yamada,

Zhang

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-totext retrieval. However, such performance does not realize in tasks that require a finergrained correspondence between vision and language, such as Visual Question Answering (VQA). As a potential cause of the difficulty of applying these models to VQA and similar tasks, we report an interesting phenomenon of vision-language models, which we call the Concept Association Bias (CAB). We find that models with CAB tend to treat input as a bag of concepts and attempt to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. We demonstrate CAB by showing that CLIP's zeroshot classification performance greatly suffers when there is a strong concept association between an object (e.g. eggplant) and an attribute (e.g. color purple). We also show that the strength of CAB predicts the performance on VQA. We observe that CAB is prevalent in vision-language models trained with contrastive losses, even when autoregressive losses are jointly employed. However, a model that solely relies on autoregressive loss seems to exhibit minimal or no signs of CAB. * Equal contribution.CLIP: "In this picture, the color of the lemon is purple.

show abstract

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Cited by 2 publications

References 0 publications

ConceptFusion: Open-set multimodal 3D mapping

ConceptFusion: Open-set multimodal 3D mapping

When are Lemons Purple? The Concept Association Bias of Vision-Language Models

Contact Info

Product

Resources

About