Simple but Effective: CLIP Embeddings for Embodied AI

Khandelwal, Apoorv; Weihs, Luca; Mottaghi, Roozbeh; Kembhavi, Aniruddha

doi:10.1109/cvpr52688.2022.01441

Cited by 89 publications

(42 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Embodied agent research (Duan et al, 2022;Batra et al, 2020;Ravichandar et al, 2020;Collins et al, 2021) is adopting the large-scale pre-training paradigm, powered by a collection of learning environments (Abramson et al, 2020;Shridhar et al, 2020;Savva et al, 2019;Puig et al, 2018;Team et al, 2021;Toyama et al, 2021;Shi et al, 2017). From the aspect of pre-training for better representations, LaTTe (Bucker et al, 2022) and Embodied-CLIP (Khandelwal et al, 2021) leverage the frozen visual and textual representations of CLIP (Radford et al, 2021) for robotic manipulation.…”

Section: E2 Vary T5 Encoder Sizesmentioning

confidence: 99%

VIMA: General Robot Manipulation with Multimodal Prompts

Jiang¹,

Gupta²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We design a transformer-based generalist robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. To train and evaluate VIMA, we develop a new simulation benchmark with thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and four levels of evaluation protocol for systematic generalization. VIMA achieves strong scalability in both model capacity and data size. It outperforms prior SOTA methods in the hardest zero-shot generalization setting by up to 2.9× task success rate given the same training data. With 10× less training data, VIMA still performs 2.7× better than the top competing approach. We open-source all code, pretrained models, dataset, and simulation benchmark at https://vimalabs.github.io.

show abstract

Section: E2 Vary T5 Encoder Sizesmentioning

confidence: 99%

VIMA: General Robot Manipulation with Multimodal Prompts

Jiang¹,

Gupta²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Gan et al (2017) and Zhao et al (2020) have suggested style-guided captioning, but also employ training over paired data. CLIP (2021) marked a turning point in visionlanguage perception, and has been utilized for vision-related tasks by various distillation techniques Song et al, 2022;Jin et al, 2021;Gal et al, 2021;Khandelwal et al, 2022). Recent captioning methods use CLIP for reducing training time (Mokady et al, 2021), improved captions (Shen et al, 2021;Luo et al, 2022a,b;Cornia et al, 2021;Kuo and Kira, 2022), and in zero-shot settings (Su et al, 2022;Tewel et al, 2022).…”

Section: Related Workmentioning

confidence: 99%

Text-Only Training for Image Captioning using Noise-Injected CLIP

Nukrai¹,

Mokady²,

Globerson³

2022

Preprint

View full text Add to dashboard Cite

We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available at https://github. com/DavidHuji/CapDec.

show abstract

“…Foundation Models in RL: Mu et al [2022] uses language to improve exploration via intrinsic rewards instead of using raw states, however their method requires a oracle language annotator which is not easily available for many RL environments. Khandelwal et al [2022] investigate the effectiveness of CLIP visual representations directly for control on Embodied AI tasks [Batra et al, 2020] by bypassing the learning of policy representations with CLIP embeddings. Their results demonstrated the effectiveness of CLIP representations for control on navigation-heavy Embodied AI tasks.…”

Section: Related Workmentioning

confidence: 99%

“…However, the representations can understand and distinguish between the geometrical shapes very well which is enough for diversity based semantic exploration. On the contrary, CLIP representations have already shown to be very effective for large range of tasks from text based video retrieval [Fang et al, 2021, Luo et al, 2021, text driven image manipulation [Patashnik et al, 2021] to embodied AI tasks [Khandelwal et al, 2022] based on realistic visual observations, for example, a kitchen scene containing a microwave.…”

Section: Clip Analysis On Minigridmentioning

confidence: 99%

Foundation Models for Semantic Novelty in Reinforcement Learning

Gupta¹,

Karkus²,

Tong³

et al. 2022

Preprint

View full text Add to dashboard Cite

Effectively exploring the environment is a key challenge in reinforcement learning (RL). We address this challenge by defining a novel intrinsic reward based on a foundation model, such as contrastive language image pretraining (CLIP), which can encode a wealth of domain-independent semantic visual-language knowledge about the world. Specifically, our intrinsic reward is defined based on pre-trained CLIP embeddings without any fine-tuning or learning on the target RL task. We demonstrate that CLIP-based intrinsic rewards can drive exploration towards semantically meaningful states and outperform state-of-the-art methods in challenging sparse-reward procedurally-generated environments.

show abstract

Simple but Effective: CLIP Embeddings for Embodied AI

Cited by 89 publications

References 15 publications

VIMA: General Robot Manipulation with Multimodal Prompts

VIMA: General Robot Manipulation with Multimodal Prompts

Text-Only Training for Image Captioning using Noise-Injected CLIP

Foundation Models for Semantic Novelty in Reinforcement Learning

Contact Info

Product

Resources

About