Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Thrush, Tristan; Jiang, Ruifen; Bartolo, Max; Singh, Amanpreet; Williams, Adina; Ross, Candace

doi:10.1109/cvpr52688.2022.00517

Cited by 87 publications

(83 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent research suggests that CLIP's compositional capabilities are limited. 75 . As shown by our results, restricted domains allow for direct manipulation, without the risk of confounding; indeed, restricted domains may be easier to explore but further investigation is needed to confirm compositional capabilities.…”

Section: Grounding and Compositionalitymentioning

confidence: 99%

Contrastive language and vision learning of general fashion concepts

Chia¹,

Attanasio

Bianchi

et al. 2022

Sci Rep

View full text Add to dashboard Cite

The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from general and transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model adapted for the fashion industry. We demonstrate the effectiveness of the representations learned by FashionCLIP with extensive tests across a variety of tasks, datasets and generalization probes. We argue that adaptations of large pre-trained models such as CLIP offer new perspectives in terms of scalability and sustainability for certain types of players in the industry. Finally, we detail the costs and environmental impact of training, and release the model weights and code as open source contribution to the community.

show abstract

Section: Grounding and Compositionalitymentioning

confidence: 99%

Contrastive language and vision learning of general fashion concepts

Chia¹,

Attanasio

Bianchi

et al. 2022

Sci Rep

View full text Add to dashboard Cite

show abstract

“…Visual-linguistic compositionality has been explored for image-language models [66,99,120,126]. The compositional nature of language allows the evaluation of various aspects: meaning change due to change in word order [99], relationship between objects [126], systematicity and productivity [66], etc.…”

Section: Time In Visionmentioning

confidence: 99%

Test of Time: Instilling Video-Language Models with a Sense of Time

Bagad¹,

Tapaswi²,

Snoek³

2023

Preprint

View full text Add to dashboard Cite

Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

show abstract

“…Recent research has started to probe VLMs' for such information. Thrush et al (2022) proposed Winoground, a dataset of hand-curated test cases that document a clear lack of compositional and pragmatic understanding in VLMs. The dataset is high quality but relatively small scale; its 400 test cases cover a wide range of linguistic phenomena (e.g., relation, attribution, pragmatics, world knowledge), making it hard to render statistically significant results about relational and attributive abilities.…”

Section: Attribution Relation and Order (Aro) Benchmark: When Do Mode...mentioning

confidence: 99%

“…Parcalabescu et al (2021) show that VLMs have difficulties in counting objects in images. In terms of the evaluation part of our paper, Winoground (Thrush et al, 2022) presents the nearest neighbor to our work. Winoground is a carefully curated dataset that aims to evaluate compositional and pragmatics language understanding of VLMs.…”

Section: Related Workmentioning

confidence: 99%

“…Natural scenes are complex, composed of many objects and attributes, in relationships with one another. While there have been important efforts to test compositional representations of objects, attributes, and relations (Thrush et al, 2022), such efforts are based on small sets of hand-crafted examples, often combined with testing many other types of knowledge. This makes it hard to evaluate the role of relational and attributional knowledge in isolation and lacks the statistical power to quantify how well VLMs perform on granular subtypes of compositions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

When and why vision-language models behave like bags-of-words, and what to do about it?

Yüksekgönül¹,

Bianchi²,

Kalluri³

et al. 2022

Preprint

View full text Add to dashboard Cite

Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode the compositional relationships between objects and attributes. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO-Order & Flickr30k-Order, to test for order sensitivity in VLMs. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We present the settings where state-of-the-art VLMs behave like bagsof-words-i.e. when they have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large scale datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on image-text retrieval over existing datasets without using the composition and order information. This further motivates the value of using ARO to benchmark VLMs. Given that contrastive pretraining optimizes for retrieval on large datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.

show abstract

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Cited by 87 publications

References 36 publications

Contrastive language and vision learning of general fashion concepts

Contrastive language and vision learning of general fashion concepts

Test of Time: Instilling Video-Language Models with a Sense of Time

When and why vision-language models behave like bags-of-words, and what to do about it?

Contact Info

Product

Resources

About