2022
DOI: 10.48550/arxiv.2203.13411
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers

Abstract: Natural language is the most intuitive medium for us to interact with other people when expressing commands and instructions. However, using language is seldom an easy task when humans need to express their intent towards robots, since most of the current language interfaces require rigid templates with a static set of action targets and commands. In this work, we provide a flexible languagebased interface for human-robot collaboration, which allows a user to reshape existing trajectories for an autonomous age… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 30 publications
0
6
0
Order By: Relevance
“…By computing the cosine similarity vector s between the embedding we are identify a possible target object that the user is referring to. In the section 3.1, we show that using the object's images or object's names (as done in [29]) brings equivalent results, since CLIP maps both images and text to a joint latent space. Finally we concatenate the similarity vector s and the semantic features q BERT (z in |L in ) forming what we call features embedding q F .…”
Section: Figure 2: Synthetic Dataset Examples and Model Predictionsmentioning
confidence: 95%
See 2 more Smart Citations
“…By computing the cosine similarity vector s between the embedding we are identify a possible target object that the user is referring to. In the section 3.1, we show that using the object's images or object's names (as done in [29]) brings equivalent results, since CLIP maps both images and text to a joint latent space. Finally we concatenate the similarity vector s and the semantic features q BERT (z in |L in ) forming what we call features embedding q F .…”
Section: Figure 2: Synthetic Dataset Examples and Model Predictionsmentioning
confidence: 95%
“…Particularly highlight should also be given to the works [28,29], which are more closely related to our approach. In [28], it is proposed a method of mapping NL to transformations of cost functions.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…visual relationships [19] Language is also used as an additional input to guide tasks such as video summarization [22]. In robotics or policy learning, the agents not only follow instructions, but also learn to update semantic map for robot manipulation [24], trajectory reshaping [5], and new skills with language inputs [32].…”
Section: Related Work 3d Human and Object Reconstructionmentioning
confidence: 99%
“…Transformers in robotics: Transformers were originally introduced in the language processing domain [120], but quickly proved to be useful in modeling long-range data dependencies other domains. Within robotics we see the first transformers architectures being used for trajectory forecasting [121], motion planning [122,123], and reinforcement learning [124,125]. The main difference between these works and GRID is that they are focused on training a model for a single task, while we propose learning representations amenable to multiple downstream tasks for a robot.…”
Section: Related Workmentioning
confidence: 99%