Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2022
DOI: 10.18653/v1/2022.naacl-main.390
|View full text |Cite
|
Sign up to set email alerts
|

Visual Commonsense in Pretrained Unimodal and Multimodal Models

Abstract: Our commonsense knowledge about objects includes their typical visual attributes; we know that bananas are typically yellow or green, and not purple. Text and image corpora, being subject to reporting bias, represent this worldknowledge to varying degrees of faithfulness. In this paper, we investigate to what degree unimodal (language-only) and multimodal (image and language) models capture a broad range of visually salient attributes. To that end, we create the Visual Commonsense Tests (ViComTe) dataset cover… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(9 citation statements)
references
References 4 publications
0
1
0
Order By: Relevance
“…Alternatively, information about event typicality might enter LLMs through input from different modalities, such as visual depictions of the world in the form of large databases of images and/or image descriptions (Bisk et al., 2020). Distributional models trained on multimodal data have indeed been shown to outperform text‐only trained models in overcoming the reporting bias for visual concept knowledge (e.g., Paik et al., 2021; Zhang et al., 2022). In the future, we plan to extend our analysis of GEK to multimodal LLMs (e.g., CLIP; Radford et al., 2021) in order to investigate the role of extralinguistic evidence, which might reduce the impact of the reporting bias and better simulate the multimodal information that humans use to acquire GEK.…”
Section: Discussionmentioning
confidence: 99%
“…Alternatively, information about event typicality might enter LLMs through input from different modalities, such as visual depictions of the world in the form of large databases of images and/or image descriptions (Bisk et al., 2020). Distributional models trained on multimodal data have indeed been shown to outperform text‐only trained models in overcoming the reporting bias for visual concept knowledge (e.g., Paik et al., 2021; Zhang et al., 2022). In the future, we plan to extend our analysis of GEK to multimodal LLMs (e.g., CLIP; Radford et al., 2021) in order to investigate the role of extralinguistic evidence, which might reduce the impact of the reporting bias and better simulate the multimodal information that humans use to acquire GEK.…”
Section: Discussionmentioning
confidence: 99%
“…We find that (i) Regarding visual concepts, both OPT and CLIP-like models perform closely to human annotators. CLIP and DeCLIP even outperform the human annotators on the shape task, which is potentially due to the noise introduced by the automatic construction of the dataset (Zhang et al, 2022a Instruction tuning enhances proficiency in both visual and embodied concepts. After posttraining with the instruction tuning dataset, Vicuna models display enhanced proficiency in both visual and embodied concepts, with larger LLMs demonstrating a more significant improvement.…”
Section: Main Findingsmentioning
confidence: 98%
“…In this work, we consider evaluating the visual understanding ability of LMs by examining their performance on various visual concepts. Specifically, we combine the recently proposed visual knowledge probing datasets, including Spatial Commonsense (Liu et al, 2022a) and ViComTe (Zhang et al, 2022a). The combined dataset requires not only understanding various generic visual concepts including color, shape, and material, but also understanding the relationship between common objects, such as size and height.…”
Section: Visual Conceptsmentioning
confidence: 99%
See 2 more Smart Citations