2022
DOI: 10.48550/arxiv.2201.07207
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Abstract: Can world knowledge learned by large language models (LLMs) be used to act in interactive environments? In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. "make breakfast"), to a chosen set of actionable steps (e.g. "open fridge"). While prior work focused on learning from explicit step-by-step examples of how to act, we surprisingly find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
44
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 24 publications
(44 citation statements)
references
References 38 publications
0
44
0
Order By: Relevance
“…This is particularly true for natural language generation problems that require careful planning such as natural language proofs or paper-writing. The domains need not be constrained to language either -tasks leveraging language models in other domains are a natural extension such as visual question answering [Fang et al, 2015] and language-guided reinforcement learning [Mu et al, 2022, Huang et al, 2022. We are excited about this avenue of future work.…”
Section: Discussionmentioning
confidence: 99%
“…This is particularly true for natural language generation problems that require careful planning such as natural language proofs or paper-writing. The domains need not be constrained to language either -tasks leveraging language models in other domains are a natural extension such as visual question answering [Fang et al, 2015] and language-guided reinforcement learning [Mu et al, 2022, Huang et al, 2022. We are excited about this avenue of future work.…”
Section: Discussionmentioning
confidence: 99%
“…re-frame "video Q&A" as a "short story Q&A" problem, where the language-based world-state history can be viewed as an interpretable state representation in the form of short stories, which can be used for reading comprehension Q&A, for which LMs have demonstrated strong zero-shot performance (Brown et al 2020). Drawing analogies to 3D vision and robotics, this can be thought of as building an on-thefly reconstruction of the sequence of events in the observable world with language, rather than other representations, such as dynamically-updated 3D meshes (Izadi et al 2011) or neural fields (Tancik et al 2022). In addition to openended question-answering in the form of text, we can also provide video search capabilities (in the form of image or audio retrieval, Fig 7) from natural language questions through zero-shot composition of SMs as well.…”
Section: System Overview: Socratic Egocentric Perceptionmentioning
confidence: 99%
“…Foundation models (Bommasani et al 2021) (e.g., BERT, GPT-3, CLIP) have enabled impressive capabilities in recent years: from zero-shot image classification (Radford et al 2021;Li et al 2021a), to high-level planning (Huang et al 2022;Ahn et al 2022). These capabilities depend on their training data distribution -and while they may be generic or indiscriminately crawled from the web, their distributions remain distinct across domains.…”
Section: Introductionmentioning
confidence: 99%
“…Natural language is an ideal candidate, given that interfaces such as mouse-and-keyboard, touchscreens and programming languages are powerful, but require extensive training for proper usage [22]. Multiple facets of language-based human-robot interaction have been studied in literature, such as instruction understanding [23,24], motion plan generation [9,12,16,25], human-robot cooperation [26], semantic belief propagation [18,19], and visual language navigation [11,27]. Most of the recent works in the field have shifted from representing language in terms of classical grammatical structure towards data-driven techniques, due higher flexibility in knowledge representations [22].…”
Section: Related Workmentioning
confidence: 99%
“…and task learning that can generalize across multiple environments (How should I do it?). Recent works have just started to explored the use of pre-existing foundational models from language and vision towards robotics [9]- [13], and also the development of robotics-specific foundational models [8,14].…”
Section: Introductionmentioning
confidence: 99%