Learning to See before Learning to Act: Visual Pre-training for Manipulation

Lin, Yiguang; Zeng, Andy; Song, Shuran; Isola, Phillip; Lin, Tsung-Yi

doi:10.1109/icra40945.2020.9197331

Cited by 58 publications

(36 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…VGN [4] predicts 6-DoF grasps in clutter with a one-stage pipeline from input depth images. There is also a line of works that estimate affordance of an object or a scene first and then detect grasps based on estimated affordance [42,24,54]. In most of the prior works, deep networks are trained end-to-end with only grasp supervision.…”

Section: A Learning Grasp Detectionmentioning

confidence: 99%

Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

Jiang¹,

Zhu²,

Svetlik³

et al. 2021

Robotics: Science and Systems XVII

View full text Add to dashboard Cite

Grasp detection in clutter requires the robot to reason about the 3D scene from incomplete and noisy perception. In this work, we draw insight that 3D reconstruction and grasp learning are two intimately connected tasks, both of which require a fine-grained understanding of local geometry details. We thus propose to utilize the synergies between grasp affordance and 3D reconstruction through multi-task learning of a shared representation. Our model takes advantage of deep implicit functions, a continuous and memory-efficient representation, to enable differentiable training of both tasks. We train the model on self-supervised grasp trials data in simulation. Evaluation is conducted on a clutter removal task, where the robot clears cluttered objects by grasping them one at a time. The experimental results in simulation and on the real robot have demonstrated that the use of implicit neural representations and joint learning of grasp affordance and 3D reconstruction have led to state-ofthe-art grasping results. Our method outperforms baselines by over 10% in terms of grasp success rate. Additional results and videos can be found at https://sites.google.com/view/rpl-giga2021

show abstract

Section: A Learning Grasp Detectionmentioning

confidence: 99%

Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

Jiang¹,

Zhu²,

Svetlik³

et al. 2021

Robotics: Science and Systems XVII

View full text Add to dashboard Cite

show abstract

“…In contrast, Yen-Chen et. al [36] showed that pre-training on semantic tasks like classification and segmentation helps in improving efficiency and generalization of grasping predictions.…”

Section: Related Workmentioning

confidence: 99%

CLIPort: What and Where Pathways for Robotic Manipulation

Shridhar,

Manuelli,

Fox

2021

Preprint

View full text Add to dashboard Cite

How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPORT, a language-conditioned imitationlearning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies. † Work done partly while the author was a part-time intern at NVIDIA.

show abstract

“…Regarding learning affordances for grasping, the majority of previous works use ground-truth affordance labels to learn affordances for grasping ( [18], [33], [34], [29]). [18] use thermal maps to learn the graspable positions of several household objects.…”

Section: Related Work a Affordance Learning For Graspingmentioning

confidence: 99%

Contrastively Learning Visual Attention as Affordance Cues from Demonstrations for Robotic Grasping

Zha¹,

Bhambri²,

Lin³

2021

Preprint

View full text Add to dashboard Cite

Conventional works that learn grasping affordance from demonstrations need to explicitly predict grasping configurations, such as gripper approaching angles or grasping preshapes. Classic motion planners could then sample trajectories by using such predicted configurations. In this work, our goal is instead to integrate the two objectives of affordance discovery and affordance-aware policy learning in an endto-end imitation learning framework based on deep neural networks. From a psychological perspective, there is a close association between attention and affordance. Therefore, with an end-to-end neural network, we propose to learn affordance cues as visual attention that serves as a useful indicating signal of how a demonstrator accomplishes tasks. To achieve this, we propose a contrastive learning framework that consists of a Siamese encoder and a trajectory decoder. We further introduce a coupled triplet loss to encourage the discovered affordance cues to be more affordance-relevant. Our experimental results demonstrate that our model with the coupled triplet loss achieves the highest grasping success rate.

show abstract

Learning to See before Learning to Act: Visual Pre-training for Manipulation

Cited by 58 publications

References 43 publications

Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

CLIPort: What and Where Pathways for Robotic Manipulation

Contrastively Learning Visual Attention as Affordance Cues from Demonstrations for Robotic Grasping

Contact Info

Product

Resources

About