Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers

Bucker, Arthur; Figueredo, Luis Felipe da Cruz; Haddadin, Sami; Kapoor, Ashish; Ma, Shaoping; Bonatti, Rogerio

doi:10.48550/arxiv.2203.13411

Cited by 4 publications

(6 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By computing the cosine similarity vector s between the embedding we are identify a possible target object that the user is referring to. In the section 3.1, we show that using the object's images or object's names (as done in [29]) brings equivalent results, since CLIP maps both images and text to a joint latent space. Finally we concatenate the similarity vector s and the semantic features q BERT (z in |L in ) forming what we call features embedding q F .…”

Section: Figure 2: Synthetic Dataset Examples and Model Predictionsmentioning

confidence: 95%

“…Particularly highlight should also be given to the works [28,29], which are more closely related to our approach. In [28], it is proposed a method of mapping NL to transformations of cost functions.…”

Section: Related Workmentioning

confidence: 99%

“…Although it is able to modify the robot behavior using language, this approach is still restricted for 2D scenarios and cost based motion planner. The work in [29] proposed a transformer based approach for 2D trajectory reshaping on using NL. User studies validated the effectiveness of a task adaptation scheme through NL in compare to other types of human robot interaction through user studies.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

LATTE: LAnguage Trajectory TransformEr

Bucker¹,

Figueredo²,

Haddadin³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Natural language is one of the most intuitive ways to express human intent. However, translating instructions and commands towards robotic motion generation, and deployment in the real world, is far from being an easy task. Indeed, combining robotic's inherent low-level geometric and kinodynamic constraints with human's high-level semantic information reinvigorates and raises new challenges to the task-design problem -typically leading to task or hardware specific solutions with a static set of action targets and commands. This work instead proposes a flexible language-based framework that allows to modify generic 3D robotic trajectories using language commands with reduced constraints about prior task or robot information. By taking advantage of pre-trained language models, we employ an auto-regressive transformer to map natural language inputs and contextual images into changes in 3D trajectories. We show through simulations and real-life experiments that the model can successfully follow human intent, modifying the shape and speed of trajectories for multiple robotic platforms and contexts. This study takes a step into building large pre-trained foundational models for robotics and shows how such models can create more intuitive and flexible interactions between human and machines. Codebase available at: https://github.com/arthurfenderbucker/NL_trajectory_reshaper.

show abstract

Section: Figure 2: Synthetic Dataset Examples and Model Predictionsmentioning

confidence: 95%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

LATTE: LAnguage Trajectory TransformEr

Bucker¹,

Figueredo²,

Haddadin³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…visual relationships [19] Language is also used as an additional input to guide tasks such as video summarization [22]. In robotics or policy learning, the agents not only follow instructions, but also learn to update semantic map for robot manipulation [24], trajectory reshaping [5], and new skills with language inputs [32].…”

Section: Related Work 3d Human and Object Reconstructionmentioning

confidence: 99%

Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors

Wang¹,

Li²,

Kuo

et al. 2022

2022 International Conference on 3D Vision (3DV)

View full text Add to dashboard Cite

We present a method for inferring diverse 3D models of human-object interactions from images. Reasoning about how humans interact with objects in complex scenes from a single 2D image is a challenging task given ambiguities arising from the loss of information through projection. In addition, modeling 3D interactions requires the generalization ability towards diverse object categories and interaction types. We propose an action-conditioned modeling of interactions that allows us to infer diverse 3D arrangements of humans and objects without supervision on contact regions or 3D scene geometry. Our method extracts highlevel commonsense knowledge from large language models (such as GPT-3), and applies them to perform 3D reasoning of human-object interactions. Our key insight is priors extracted from large language models can help in reasoning about human-object contacts from textural prompts only. We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset and show how our method leads to better 3D reconstructions. We further qualitatively evaluate the effectiveness of our method on real images and demonstrate its generalizability towards interaction types and object categories.

show abstract

“…Transformers in robotics: Transformers were originally introduced in the language processing domain [120], but quickly proved to be useful in modeling long-range data dependencies other domains. Within robotics we see the first transformers architectures being used for trajectory forecasting [121], motion planning [122,123], and reinforcement learning [124,125]. The main difference between these works and GRID is that they are focused on training a model for a single task, while we propose learning representations amenable to multiple downstream tasks for a robot.…”

Section: Related Workmentioning

confidence: 99%

Adversarial Attacks on Optimization based Planners

Vemprala

Kapoor

2021

2021 IEEE International Conference on Robotics and Automation (ICRA)

View full text Add to dashboard Cite

Developing machine intelligence abilities in robots and autonomous systems is an expensive and time consuming process. Existing solutions are tailored to specific applications and are harder to generalize. Furthermore, scarcity of training data adds a layer of complexity in deploying deep machine learning models. We present a new platform for General Robot Intelligence Development (GRID) to address both of these issues. The platform enables robots to learn, compose and adapt skills to their physical capabilities, environmental constraints and goals. The platform addresses AI problems in robotics via foundation models that know the physical world. GRID is designed from the ground up to be extensible to accommodate new types of robots, vehicles, hardware platforms and software protocols. In addition, the modular design enables various deep ML components and existing foundation models to be easily usable in a wider variety of robot-centric problems. We demonstrate the platform in various aerial robotics scenarios and demonstrate how the platform dramatically accelerates development of machine intelligent robots.The GRID platform can be accessed at https:// github.com/ScaledFoundations/GRID-playground.

show abstract

Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers

Cited by 4 publications

References 30 publications

LATTE: LAnguage Trajectory TransformEr

LATTE: LAnguage Trajectory TransformEr

Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors

Adversarial Attacks on Optimization based Planners

Contact Info

Product

Resources

About