2022
DOI: 10.48550/arxiv.2204.00598
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Abstract: Large foundation models can exhibit unique capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internetscale text with no images (e.g. from spreadsheets, to SAT questions). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
42
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 44 publications
(60 citation statements)
references
References 59 publications
(96 reference statements)
0
42
0
Order By: Relevance
“…They instead explore conditioning the frozen language model by grafting new learnt layers to the frozen language model. Finally, PICA and Socratic Models (Zeng et al, 2022) propose to use off-the-shelf vision-language models (Radford et al, 2021; to communicate the content of images using language descriptions to GPT-3 (Brown et al, 2020).…”
Section: Joint Vision and Language Modellingmentioning
confidence: 99%
“…They instead explore conditioning the frozen language model by grafting new learnt layers to the frozen language model. Finally, PICA and Socratic Models (Zeng et al, 2022) propose to use off-the-shelf vision-language models (Radford et al, 2021; to communicate the content of images using language descriptions to GPT-3 (Brown et al, 2020).…”
Section: Joint Vision and Language Modellingmentioning
confidence: 99%
“…Inner Monologue [18] expands LLM planning by incorporating outputs from success detectors or other visual language models and uses their feedback to re-plan. Socratic Models [16] uses visual language models to substitute perceptual information (in teal) into the language prompts that generate plans, and it uses language-conditioned policies e.g., for grasping [36]. The following example illustrates the qualitative differences between our approach versus the aforementioned prior works.…”
Section: Perception Apis Control Apismentioning
confidence: 99%
“…Meanwhile, recent progress in natural language processing shows that large language models (LLMs) pretrained on Internetscale data [11]- [13] exhibit out-of-the-box capabilities [14]- [16] that can be applied to language-using robots e.g., planning a sequence of steps from natural language instructions [16]- [18] without additional model finetuning. These steps can be grounded in real robot affordances from value functions among a fixed set of skills i.e., policies pretrained with behavior cloning or reinforcement learning [19]- [21].…”
Section: Introductionmentioning
confidence: 99%
“…Those models are capable of performing image captioning in a zero-shot manner. A concurrent work, Socratic Models (SM, Zeng et al (2022)), also use textual data to bridge the domain gap between vision-language models and language models. The model, however, is stronger in retrieval tasks than captioning tasks as we will show later.…”
Section: Constrained Text Generationmentioning
confidence: 99%
“…With the domain specified to VG, the performance is further boosted. Figure 4 shows an example with a image cropped from the Socratic model paper (Zeng et al, 2022) directly. We find that caption generation does not require the complex prompt used in Socratic model.…”
Section: Best-vg Domainmentioning
confidence: 99%