Visual Commonsense in Pretrained Unimodal and Multimodal Models

Zhang, Chenyu; Durme, Benjamin Van; Li, Zhuowan; Stengel-Eskin, Elias

doi:10.18653/v1/2022.naacl-main.390

Cited by 5 publications

(9 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Alternatively, information about event typicality might enter LLMs through input from different modalities, such as visual depictions of the world in the form of large databases of images and/or image descriptions (Bisk et al., 2020). Distributional models trained on multimodal data have indeed been shown to outperform text‐only trained models in overcoming the reporting bias for visual concept knowledge (e.g., Paik et al., 2021; Zhang et al., 2022). In the future, we plan to extend our analysis of GEK to multimodal LLMs (e.g., CLIP; Radford et al., 2021) in order to investigate the role of extralinguistic evidence, which might reduce the impact of the reporting bias and better simulate the multimodal information that humans use to acquire GEK.…”

Section: Discussionmentioning

confidence: 99%

Event Knowledge in Large Language Models: The Gap Between the Impossible and the Unlikely

Kauf,

Ivanova,

Rambelli

et al. 2023

Cognitive Science

View full text Add to dashboard Cite

Word co‐occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs’ semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pretrained LLMs (from 2018's BERT to 2023's MPT) assign a higher likelihood to plausible descriptions of agent−patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n = 1215), we found that pretrained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign a higher likelihood to possible versus impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely versus unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow‐up analyses, we show that (i) LLM scores are driven by both plausibility and surface‐level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events.

show abstract

Section: Discussionmentioning

confidence: 99%

Event Knowledge in Large Language Models: The Gap Between the Impossible and the Unlikely

Kauf,

Ivanova,

Rambelli

et al. 2023

Cognitive Science

View full text Add to dashboard Cite

show abstract

“…We find that (i) Regarding visual concepts, both OPT and CLIP-like models perform closely to human annotators. CLIP and DeCLIP even outperform the human annotators on the shape task, which is potentially due to the noise introduced by the automatic construction of the dataset (Zhang et al, 2022a Instruction tuning enhances proficiency in both visual and embodied concepts. After posttraining with the instruction tuning dataset, Vicuna models display enhanced proficiency in both visual and embodied concepts, with larger LLMs demonstrating a more significant improvement.…”

Section: Main Findingsmentioning

confidence: 98%

“…In this work, we consider evaluating the visual understanding ability of LMs by examining their performance on various visual concepts. Specifically, we combine the recently proposed visual knowledge probing datasets, including Spatial Commonsense (Liu et al, 2022a) and ViComTe (Zhang et al, 2022a). The combined dataset requires not only understanding various generic visual concepts including color, shape, and material, but also understanding the relationship between common objects, such as size and height.…”

Section: Visual Conceptsmentioning

confidence: 99%

“…Interestingly, with the scaling up of OPT-family models, the prediction accuracy increases obviously on specific visual concepts such as color and size. On material and color, the largest OPT-175B model even achieves better results than VLMs of CLIP-ViT/L-14, which are augmented with vision supervision and are supposed to perform better (Zhang et al, 2022a;Liu et al, 2022b).…”

Section: Main Findingsmentioning

confidence: 99%

“…Various probing methods have been developed (Tenney et al, 2019b;Petroni et al, 2019), and investigations show that LMs capture linguistic (Tenney et al, 2019a;Liu et al, 2019a), factual (Petroni et al, 2019;Roberts et al, 2020;, commonsense knowledge Forbes et al, 2019), and even acquire grounded concepts (Patel and Pavlick, 2022). For VLMs, studies demonstrate their potential in acquiring spatial commonsense (Zhang et al, 2022a;Liu et al, 2022a; and color perception (Abdou et al, 2021), yet performing worse on NLU tasks (Tan and Bansal, 2020) and achieving no significant on lexical grounding .…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Can Language Models Understand Physical Concepts?

Li,

Xu,

Dong

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Language models (LMs) gradually become general-purpose interfaces in the interactive and embodied world, where the understanding of physical concepts is an essential prerequisite. However, it is unclear whether LMs can understand physical concepts in the human world. To investigate this, we design a benchmark VEC that covers the tasks of (i) Visual concepts, such as the shape and material of objects, and (ii) Embodied Concepts, learned from the interaction with the world such as the temperature of objects. Our zero (few)-shot prompting results show that the understanding of certain visual concepts emerges as scaling up LMs, but there are still basic concepts to which the scaling law does not apply. For example, OPT-175B performs close to humans with a zero-shot accuracy of 85% on the material concept, yet behaves like random guessing on the mass concept. Instead, vision-augmented LMs such as CLIP and BLIP achieve a human-level understanding of embodied concepts. Analysis indicates that the rich semantics in visual representation can serve as a valuable source of embodied knowledge. Inspired by this, we propose a distillation method to transfer embodied knowledge from VLMs to LMs, achieving performance gain comparable with that by scaling up parameters of LMs 134×. 1

show abstract

Large language models predict human sensory judgments across six modalities

Marjieh,

Sucholutsky,

van Rijn

et al. 2024

Sci Rep

View full text Add to dashboard Cite

Determining the extent to which the perceptual world can be recovered from language is a longstanding problem in philosophy and cognitive science. We show that state-of-the-art large language models can unlock new insights into this problem by providing a lower bound on the amount of perceptual information that can be extracted from language. Specifically, we elicit pairwise similarity judgments from GPT models across six psychophysical datasets. We show that the judgments are significantly correlated with human data across all domains, recovering well-known representations like the color wheel and pitch spiral. Surprisingly, we find that a model (GPT-4) co-trained on vision and language does not necessarily lead to improvements specific to the visual modality, and provides highly correlated predictions with human data irrespective of whether direct visual input is provided or purely textual descriptors. To study the impact of specific languages, we also apply the models to a multilingual color-naming task. We find that GPT-4 replicates cross-linguistic variation in English and Russian illuminating the interaction of language and perception.

show abstract

Visual Commonsense in Pretrained Unimodal and Multimodal Models

Cited by 5 publications

References 4 publications

Event Knowledge in Large Language Models: The Gap Between the Impossible and the Unlikely

Event Knowledge in Large Language Models: The Gap Between the Impossible and the Unlikely

Can Language Models Understand Physical Concepts?

Large language models predict human sensory judgments across six modalities

Contact Info

Product

Resources

About