A Joint Network for Grasp Detection Conditioned on Natural Language Commands

Chen, Yiye; Xu, Ruinian; Lin, Yunzhi; Vela, Patricio A.

doi:10.1109/icra48506.2021.9561994

Cited by 19 publications

(11 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To address this problem, some recent researches have proposed to merge language grounding into vision-based manipulation and grasping pipelines [4]- [6], [9]- [13]. Conditioned on language, the robot can understand and execute a diverse range of VLG tasks.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Task-Oriented Grasp Prediction with Visual-Language Inputs

Tang¹,

Huang²,

Meng³

et al. 2023

Preprint

View full text Add to dashboard Cite

To perform household tasks, assistive robots receive commands in the form of user language instructions for tool manipulation. The initial stage involves selecting the intended tool (i.e., object grounding) and grasping it in a task-oriented manner (i.e., task grounding). Nevertheless, prior researches on visual-language grasping (VLG) focus on object grounding, while disregarding the fine-grained impact of tasks on object grasping. Task-incompatible grasping of a tool will inevitably limit the success of subsequent manipulation steps. Motivated by this problem, this paper proposes GraspCLIP, which addresses the challenge of task grounding in addition to object grounding to enable task-oriented grasp prediction with visuallanguage inputs. Evaluation on a custom dataset demonstrates that GraspCLIP achieves superior performance over established baselines with object grounding only. The effectiveness of the proposed method is further validated on an assistive robotic arm platform for grasping previously unseen kitchen tools given the task specification. Our presentation video is available at: https://www.youtube.com/watch?v=e1wfYQPeAXU.

show abstract

Section: Related Workmentioning

confidence: 99%

“…11 templates are adapted from [28]. Similar to [13], we further augment the templates with QuillBot, an automatic paraphraser, to enrich the vocabulary and grammatical diversities. There are two types of instructions: (1) task with a target object (e.g., "Use…”

Section: Datasetmentioning

confidence: 99%

Task-Oriented Grasp Prediction with Visual-Language Inputs

Tang¹,

Huang²,

Meng³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Natural language provides a human-interactive interface to link humans to robots, which is important for deploying robots in our lives. Many studies [12]- [16] have explored how robots follow language instructions, in which robots are required to complete tasks specified by the language. Some studies [17]- [19] have learned language-conditioned behaviors through imitation learning.…”

Section: Related Workmentioning

confidence: 99%

ERRA: An Embodied Representation and Reasoning Architecture for Long-Horizon Language-Conditioned Manipulation Tasks

Zhao

Yuan

Jiang

et al. 2023

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

This letter introduces ERRA, an embodied learning architecture that enables robots to jointly obtain three fundamental capabilities (reasoning, planning, and interaction) for solving long-horizon language-conditioned manipulation tasks. ERRA is based on tightly-coupled probabilistic inferences at two granularity levels. Coarse-resolution inference is formulated as sequence generation through a large language model, which infers action language from natural language instruction and environment state. The robot then zooms to the fine-resolution inference part to perform the concrete action corresponding to the action language. Fine-resolution inference is constructed as a Markov decision process, which takes action language and environmental sensing as observations and outputs the action. The results of action execution in environments provide feedback for subsequent coarse-resolution reasoning. Such coarse-to-fine inference allows the robot to decompose and achieve longhorizon tasks interactively. In extensive experiments, we show that ERRA can complete various long-horizon manipulation tasks specified by abstract language instructions. We also demonstrate successful generalization to the novel but similar natural language instructions.

show abstract

“…• Benchmark Performance: In simulation experiment, we evaluate model performances of the collision-free grasp using the object retrieval top-k recall (R@k) and top-k precision (P@k) metrics to evaluate multi-grasp detection (Hu et al, 2016). Chen et al (2021b) proposes above metric to evaluate language-based multi-grasping. We do not compare it with our work directly, because: (i) their work (including dataset) is not open-sourced.…”

Section: Settingsmentioning

confidence: 99%

“…It is useful for warehousing, manufacturing, medicine, retail, and service robots. One setting in robotic grasping is to grasp object orderly without disturbing the remaining in cluttered scenes (Chen et al, 2021b;Mees and Burgard, 2020;Zhang et al, 2021a) (called collision-free grasp). To solve this problem, a typical method parses the input into a scene graph first (Figure.…”

Section: Introductionmentioning

confidence: 99%

Human-in-the-loop Robotic Grasping using BERT Scene Representation

Song¹,

Peng²,

Fang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Current NLP techniques have been greatly applied in different domains. In this paper, we propose a human-in-the-loop framework for robotic grasping in cluttered scenes, investigating a language interface to the grasping process, which allows the user to intervene by natural language commands. This framework is constructed on a state-of-the-art grasping baseline, where we substitute a scenegraph representation with a text representation of the scene using BERT. Experiments on both simulation and physical robot show that the proposed method outperforms conventional object-agnostic and scene-graph based methods in the literature. In addition, we find that with human intervention, performance can be significantly improved. Our dataset and code are available on our project website 1 .

show abstract

A Joint Network for Grasp Detection Conditioned on Natural Language Commands

Cited by 19 publications

References 29 publications

Task-Oriented Grasp Prediction with Visual-Language Inputs

Task-Oriented Grasp Prediction with Visual-Language Inputs

ERRA: An Embodied Representation and Reasoning Architecture for Long-Horizon Language-Conditioned Manipulation Tasks

Human-in-the-loop Robotic Grasping using BERT Scene Representation

Contact Info

Product

Resources

About