2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01441
|View full text |Cite
|
Sign up to set email alerts
|

Simple but Effective: CLIP Embeddings for Embodied AI

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
31
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 89 publications
(42 citation statements)
references
References 15 publications
0
31
0
Order By: Relevance
“…Embodied agent research (Duan et al, 2022;Batra et al, 2020;Ravichandar et al, 2020;Collins et al, 2021) is adopting the large-scale pre-training paradigm, powered by a collection of learning environments (Abramson et al, 2020;Shridhar et al, 2020;Savva et al, 2019;Puig et al, 2018;Team et al, 2021;Toyama et al, 2021;Shi et al, 2017). From the aspect of pre-training for better representations, LaTTe (Bucker et al, 2022) and Embodied-CLIP (Khandelwal et al, 2021) leverage the frozen visual and textual representations of CLIP (Radford et al, 2021) for robotic manipulation.…”
Section: E2 Vary T5 Encoder Sizesmentioning
confidence: 99%
“…Embodied agent research (Duan et al, 2022;Batra et al, 2020;Ravichandar et al, 2020;Collins et al, 2021) is adopting the large-scale pre-training paradigm, powered by a collection of learning environments (Abramson et al, 2020;Shridhar et al, 2020;Savva et al, 2019;Puig et al, 2018;Team et al, 2021;Toyama et al, 2021;Shi et al, 2017). From the aspect of pre-training for better representations, LaTTe (Bucker et al, 2022) and Embodied-CLIP (Khandelwal et al, 2021) leverage the frozen visual and textual representations of CLIP (Radford et al, 2021) for robotic manipulation.…”
Section: E2 Vary T5 Encoder Sizesmentioning
confidence: 99%
“…Gan et al (2017) and Zhao et al (2020) have suggested style-guided captioning, but also employ training over paired data. CLIP (2021) marked a turning point in visionlanguage perception, and has been utilized for vision-related tasks by various distillation techniques Song et al, 2022;Jin et al, 2021;Gal et al, 2021;Khandelwal et al, 2022). Recent captioning methods use CLIP for reducing training time (Mokady et al, 2021), improved captions (Shen et al, 2021;Luo et al, 2022a,b;Cornia et al, 2021;Kuo and Kira, 2022), and in zero-shot settings (Su et al, 2022;Tewel et al, 2022).…”
Section: Related Workmentioning
confidence: 99%
“…Foundation Models in RL: Mu et al [2022] uses language to improve exploration via intrinsic rewards instead of using raw states, however their method requires a oracle language annotator which is not easily available for many RL environments. Khandelwal et al [2022] investigate the effectiveness of CLIP visual representations directly for control on Embodied AI tasks [Batra et al, 2020] by bypassing the learning of policy representations with CLIP embeddings. Their results demonstrated the effectiveness of CLIP representations for control on navigation-heavy Embodied AI tasks.…”
Section: Related Workmentioning
confidence: 99%
“…However, the representations can understand and distinguish between the geometrical shapes very well which is enough for diversity based semantic exploration. On the contrary, CLIP representations have already shown to be very effective for large range of tasks from text based video retrieval [Fang et al, 2021, Luo et al, 2021, text driven image manipulation [Patashnik et al, 2021] to embodied AI tasks [Khandelwal et al, 2022] based on realistic visual observations, for example, a kitchen scene containing a microwave.…”
Section: Clip Analysis On Minigridmentioning
confidence: 99%