StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects

Liu, Weiyu; Paxton, Chris; Hermans, Tucker; Fox, Dieter

doi:10.48550/arxiv.2110.10189

Cited by 5 publications

(9 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, setting up a real-world TAMP system often requires substantial task-specific knowledge and accurate 3D models of the environment, significantly limiting the environments to which the system can generalize. To address this challenge, recent work has adopted deep learning-based approaches for robotic manipulation, for instance, on grasp planning [44,47,48,62,65], motion planning [7,57], and reasoning about spatial relations [20,36,49].…”

Section: Related Workmentioning

confidence: 99%

“…In contrast, we propose to use optical flow as the low-level feature descriptors, which can be naturally used to infer the full 6D transformations. In parallel to our work, recent efforts have also addressed rearrangement particularly learned from human demonstrations [14,72] and also with different goal specifications such as language [36,55].…”

Section: Related Workmentioning

confidence: 99%

“…With varying task setups, the desired goal state can be provided in different forms, for instance, a compact state representation [32,67] or natural language descriptions [36,55]. In this work, we address the rearrangement task where the goal state is specified by an RGB-D image [34,51], as shown in Fig.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

IFOR: Iterative Flow Minimization for Robotic Object Rearrangement

Goyal¹,

Mousavian²,

Paxton³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Figure 1. An example of IFOR being applied to real data. The initial and goal scenes are shown on the left.Our approach allows the robot to repeatedly identify transformations that will minimize the flow for various objects between the current and goal scenes. It can then repeatedly grasp, move, and place objects, rotating as necessary, in order to achieve the configuration in the goal scene. The system is trained completely on synthetic data and transfers to the real world in zero-shot manner.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

IFOR: Iterative Flow Minimization for Robotic Object Rearrangement

Goyal¹,

Mousavian²,

Paxton³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Language-Instructed Manipulation Recently, various manipulation tasks have been researched with language input either describing the entire task, or serving interactive input for task specifications. Structformer [23] proposes an object selection network from language and visual encodings, as well as a language conditioned pose generator for semantic object rearrangement. Stepputtis et al [24] proposed a closed-loop control model for pouring tasks.…”

Section: Related Workmentioning

confidence: 99%

VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Zheng¹,

Chen²,

Jenkins³

et al. 2022

Preprint

View full text Add to dashboard Cite

Benefiting from language flexibility and compositionality, humans naturally intend to use language to command an embodied agent for complex tasks such as navigation and object manipulation. In this work, we aim to fill the blank of the last mile of embodied agents-object manipulation by following human guidance, e.g., "move the red mug next to the box while keeping it upright." To this end, we introduce an Automatic Manipulation Solver (AMSolver) simulator and build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks. Specifically, modular rule-based task templates are created to automatically generate robot demonstrations with language instructions, consisting of diverse object shapes and appearances, action types, and motion constraints. We also develop a keypoint-based model 6D-CLIPort to deal with multi-view observations and language input and output a sequence of 6 degrees of freedom (DoF) actions. We hope the new simulator and benchmark will facilitate future research on language-guided robotic manipulation.Preprint. Under review.

show abstract

“…Natural language processing has recently received much attention in the field of robotics [8], following the advances made towards learning groundings between vision and language [9], [10], [11]. Recent successes in humanrobot interaction include an interactive fetching system to localize objects mentioned in referring expressions [12], [13], [14], [15], [16] or grounding not only objects, but also spatial relations to follow language expressions characterizing pick-and-place commands [17], [18], [19]. By contrast, CALVIN tasks require grounding language to a wide variety of general-purpose robot skills.…”

Section: Related Workmentioning

confidence: 99%