Deep multimodal embedding: Manipulating novel objects with point-clouds, language and trajectories

Sung, Jaeyong; Lenz, Ian; Saxena, Ashutosh

doi:10.1109/icra.2017.7989325

Cited by 26 publications

(17 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As mentioned in section Fusion, the fusion of learning modes adopts the most appropriate learning modes at each part of the method architecture to improve model performance and reduce learning costs. In the contrast with multi-mode fusion, multimodal fusion not only relies on point cloud itself to improve the ability of grasp learning, it also utilizes language or tactile sense to enrich the features extracted in the learning process so that the robot has more grasping knowledge (Sung et al, 2017 ; Zhou Y. et al, 2018 ; Abi-Farraj et al, 2019 ; Kumar et al, 2019 ; Ottenhaus et al, 2019 ; Wang T. et al, 2019 ; Watkins-Valls et al, 2019 ).…”

Section: Challenges and Future Directionsmentioning

confidence: 99%

Robotics Dexterous Grasping: The Methods Based on Point Cloud and Deep Learning

et al. 2021

View full text Add to dashboard Cite

Dexterous manipulation, especially dexterous grasping, is a primitive and crucial ability of robots that allows the implementation of performing human-like behaviors. Deploying the ability on robots enables them to assist and substitute human to accomplish more complex tasks in daily life and industrial production. A comprehensive review of the methods based on point cloud and deep learning for robotics dexterous grasping from three perspectives is given in this paper. As a new category schemes of the mainstream methods, the proposed generation-evaluation framework is the core concept of the classification. The other two classifications based on learning modes and applications are also briefly described afterwards. This review aims to afford a guideline for robotics dexterous grasping researchers and developers.

show abstract

Section: Challenges and Future Directionsmentioning

confidence: 99%

Robotics Dexterous Grasping: The Methods Based on Point Cloud and Deep Learning

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Another consideration for our remote, novice programming proposal is whether this work can be crowdsourced effectively. The success of crowdsourcing in other tasks has been shown extensively in prior work, including for spoken dialog generation for conversational systems ( Lasecki et al, 2013 ; Mitchell et al, 2014 ; Leite et al, 2016 ; Yu et al, 2016 ; Guo et al, 2017 ; Kennedy et al, 2017 ; Huang et al, 2018 ; Jonell et al, 2019 ) and interaction data and non-verbal behavior ( Orkin and Roy, 2007 ; Orkin and Roy, 2009 ; Chernova et al, 2010 ; Rossen and Lok, 2012 ; Breazeal et al, 2013 ; Sung et al, 2016 ). In previous research that is, more relevant to ours, Lee and Ko (2011) crowdsourced non-expert programmers for an online study and found that personified feedback of a robot blaming itself for errors increased the non-programmers’ motivation to program.…”

Section: Introductionmentioning

confidence: 95%

“…Prior work has used crowdsourcing for spoken dialog generation for conversational systems ( Jurčíček et al, 2011 ; Lasecki et al, 2013 ; Mitchell et al, 2014 ; Leite et al, 2016 ; Yu et al, 2016 ; Guo et al, 2017 ; Kennedy et al, 2017 ; Huang et al, 2018 ; Jonell et al, 2019 ), interaction data and non-verbal behavior ( Orkin and Roy, 2007 , Orkin and Roy, 2009 ; Chernova et al, 2010 ; Rossen and Lok, 2012 ; Breazeal et al, 2013 ; Sung et al, 2016 ). However, no previous work created a method to collect new robot behaviors for day-to-day tasks on a large scale using semi-situated non-experts.…”

Section: Related Workmentioning

confidence: 99%

Exploring Non-Expert Robot Programming Through Crowdsourcing

et al. 2021

View full text Add to dashboard Cite

A longstanding barrier to deploying robots in the real world is the ongoing need to author robot behavior. Remote data collection–particularly crowdsourcing—is increasingly receiving interest. In this paper, we make the argument to scale robot programming to the crowd and present an initial investigation of the feasibility of this proposed method. Using an off-the-shelf visual programming interface, non-experts created simple robot programs for two typical robot tasks (navigation and pick-and-place). Each needed four subtasks with an increasing number of programming statements (if statement, while loop, variables) for successful completion of the programs. Initial findings of an online study (N = 279) indicate that non-experts, after minimal instruction, were able to create simple programs using an off-the-shelf visual programming interface. We discuss our findings and identify future avenues for this line of research.

show abstract

“…Mapping different modalities of data into same latent space has been studied before. It has been shown that images and text can be projected into a single space using neural networks (Wang, Li, and Lazebnik 2016;Zhang et al 2017), so did the combination of pointcloud, text, and robot manipulation trajectories (Sung, Lenz, and Saxena 2017), and the combination of text, location, and time (Zhang et al 2017). However, instead of aiming at embedding data of different modalities into one space, the question we are interested in is how to obtain the embedding of a "container" (neighborhood) by integrating multiple modalities of data inside the "container".…”

Section: Related Workmentioning

confidence: 99%

Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding

Wang

Rajagopal

2020

AAAI

View full text Add to dashboard Cite

Understanding intrinsic patterns and predicting spatiotemporal characteristics of cities require a comprehensive representation of urban neighborhoods. Existing works relied on either inter- or intra-region connectivities to generate neighborhood representations but failed to fully utilize the informative yet heterogeneous data within neighborhoods. In this work, we propose Urban2Vec, an unsupervised multi-modal framework which incorporates both street view imagery and point-of-interest (POI) data to learn neighborhood embeddings. Specifically, we use a convolutional neural network to extract visual features from street view images while preserving geospatial similarity. Furthermore, we model each POI as a bag-of-words containing its category, rating, and review information. Analog to document embedding in natural language processing, we establish the semantic similarity between neighborhood (“document”) and the words from its surrounding POIs in the vector space. By jointly encoding visual, textual, and geospatial information into the neighborhood representation, Urban2Vec can achieve performances better than baseline models and comparable to fully-supervised methods in downstream prediction tasks. Extensive experiments on three U.S. metropolitan areas also demonstrate the model interpretability, generalization capability, and its value in neighborhood similarity analysis.

show abstract

Deep multimodal embedding: Manipulating novel objects with point-clouds, language and trajectories

Cited by 26 publications

References 24 publications

Robotics Dexterous Grasping: The Methods Based on Point Cloud and Deep Learning

Robotics Dexterous Grasping: The Methods Based on Point Cloud and Deep Learning

Exploring Non-Expert Robot Programming Through Crowdsourcing

Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding

Contact Info

Product

Resources

About