2017 IEEE International Conference on Robotics and Automation (ICRA) 2017
DOI: 10.1109/icra.2017.7989325
|View full text |Cite
|
Sign up to set email alerts
|

Deep multimodal embedding: Manipulating novel objects with point-clouds, language and trajectories

Abstract: A robot operating in a real-world environment needs to perform reasoning over a variety of sensor modalities such as vision, language and motion trajectories. However, it is extremely challenging to manually design features relating such disparate modalities. In this work, we introduce an algorithm that learns to embed point-cloud, natural language, and manipulation trajectory data into a shared embedding space with a deep neural network. To learn semantically meaningful spaces throughout our network, we use a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(17 citation statements)
references
References 24 publications
0
16
0
Order By: Relevance
“…As mentioned in section Fusion, the fusion of learning modes adopts the most appropriate learning modes at each part of the method architecture to improve model performance and reduce learning costs. In the contrast with multi-mode fusion, multimodal fusion not only relies on point cloud itself to improve the ability of grasp learning, it also utilizes language or tactile sense to enrich the features extracted in the learning process so that the robot has more grasping knowledge (Sung et al, 2017 ; Zhou Y. et al, 2018 ; Abi-Farraj et al, 2019 ; Kumar et al, 2019 ; Ottenhaus et al, 2019 ; Wang T. et al, 2019 ; Watkins-Valls et al, 2019 ).…”
Section: Challenges and Future Directionsmentioning
confidence: 99%
“…As mentioned in section Fusion, the fusion of learning modes adopts the most appropriate learning modes at each part of the method architecture to improve model performance and reduce learning costs. In the contrast with multi-mode fusion, multimodal fusion not only relies on point cloud itself to improve the ability of grasp learning, it also utilizes language or tactile sense to enrich the features extracted in the learning process so that the robot has more grasping knowledge (Sung et al, 2017 ; Zhou Y. et al, 2018 ; Abi-Farraj et al, 2019 ; Kumar et al, 2019 ; Ottenhaus et al, 2019 ; Wang T. et al, 2019 ; Watkins-Valls et al, 2019 ).…”
Section: Challenges and Future Directionsmentioning
confidence: 99%
“…Another consideration for our remote, novice programming proposal is whether this work can be crowdsourced effectively. The success of crowdsourcing in other tasks has been shown extensively in prior work, including for spoken dialog generation for conversational systems ( Lasecki et al, 2013 ; Mitchell et al, 2014 ; Leite et al, 2016 ; Yu et al, 2016 ; Guo et al, 2017 ; Kennedy et al, 2017 ; Huang et al, 2018 ; Jonell et al, 2019 ) and interaction data and non-verbal behavior ( Orkin and Roy, 2007 ; Orkin and Roy, 2009 ; Chernova et al, 2010 ; Rossen and Lok, 2012 ; Breazeal et al, 2013 ; Sung et al, 2016 ). In previous research that is, more relevant to ours, Lee and Ko (2011) crowdsourced non-expert programmers for an online study and found that personified feedback of a robot blaming itself for errors increased the non-programmers’ motivation to program.…”
Section: Introductionmentioning
confidence: 95%
“…Prior work has used crowdsourcing for spoken dialog generation for conversational systems ( Jurčíček et al, 2011 ; Lasecki et al, 2013 ; Mitchell et al, 2014 ; Leite et al, 2016 ; Yu et al, 2016 ; Guo et al, 2017 ; Kennedy et al, 2017 ; Huang et al, 2018 ; Jonell et al, 2019 ), interaction data and non-verbal behavior ( Orkin and Roy, 2007 , Orkin and Roy, 2009 ; Chernova et al, 2010 ; Rossen and Lok, 2012 ; Breazeal et al, 2013 ; Sung et al, 2016 ). However, no previous work created a method to collect new robot behaviors for day-to-day tasks on a large scale using semi-situated non-experts.…”
Section: Related Workmentioning
confidence: 99%
“…Mapping different modalities of data into same latent space has been studied before. It has been shown that images and text can be projected into a single space using neural networks (Wang, Li, and Lazebnik 2016;Zhang et al 2017), so did the combination of pointcloud, text, and robot manipulation trajectories (Sung, Lenz, and Saxena 2017), and the combination of text, location, and time (Zhang et al 2017). However, instead of aiming at embedding data of different modalities into one space, the question we are interested in is how to obtain the embedding of a "container" (neighborhood) by integrating multiple modalities of data inside the "container".…”
Section: Related Workmentioning
confidence: 99%