Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

Lee, Michelle A.; Zhu, Yuke; Srinivasan, K.; Shah, Parth; Savarese, Silvio; Li, Feifei; Garg, Animesh; Bohg, Jeannette

doi:10.1109/icra.2019.8793485

Cited by 256 publications

(139 citation statements)

References 55 publications

Supporting

Mentioning

138

Contrasting

Order By: Relevance

“…The Deterministic model is based after the model proposed in our previous work [39], which does not use a probabilistic graphical model framework. Instead we are using deterministic encoders to learn the representation and deterministic decoders to predict the same self-supervised objectives.…”

Section: A Deterministic Modelmentioning

confidence: 99%

“…In our previous work [39], we have used the same robot for real world experiments. Here, we use the Franka Panda robot (also with 7-DoF, torquecontrolled) to emphasize that the results reported in [39] are reproducible on different hardware. Four sensor modalities are available in both simulation and real hardware, including proprioception, an RGB-D camera, and a force-torque sensor.…”

Section: Experiments: Design and Setupmentioning

confidence: 99%

“…As stated earlier, our Full Model completes insertion 78% of the time. In [39], the Deterministic Model completed insertion at 76% success rate for a 3-DoF peg insertion task.…”

Section: A Simulation Experimentsmentioning

confidence: 99%

“…In our previous work [39], we evaluated the Deterministic Model with a 3D action space representing Cartesian position displacements. On the physical robot platform we evaluated the policies with round, triangular, and semicircular pegs.…”

Section: B Real Robot Experimentsmentioning

confidence: 99%

“…3) Evaluation of generalization to tasks with different peg geometry and of robustness to perturbation and sensor noise. This work is an extended version of a previously published conference paper [39]. We propose a new variational representation learning technique and significantly expand the experimental evaluation of the overall methodology in the following ways: 4) Analysis of our multimodal representation model compared to baseline models with different learning objectives, architecture, and dimension of the representation.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

et al. 2020

Self Cite

View full text Add to dashboard Cite

Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. It is non-trivial to manually design a robot controller that combines these modalities which have very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. In this work, we use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. Evaluating our method on a peg insertion task, we show that it generalizes over varying geometries, configurations, and clearances, while being robust to external perturbations. We also systematically study different self-supervised learning objectives and representation learning architectures. Results are presented in simulation and on a physical robot.

show abstract

Section: A Deterministic Modelmentioning

confidence: 99%

Section: Experiments: Design and Setupmentioning

confidence: 99%

“…As stated earlier, our Full Model completes insertion 78% of the time. In [39], the Deterministic Model completed insertion at 76% success rate for a 3-DoF peg insertion task.…”

Section: A Simulation Experimentsmentioning

confidence: 99%

Section: B Real Robot Experimentsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

Predictive Learning of Error Recovery with a Sensorized Passivity‐Based Soft Anthropomorphic Hand

Gilday

Thuruthel

Iida

2023

Advanced Intelligent Systems

View full text Add to dashboard Cite

Manipulation strategies based on the passive dynamics of soft-bodied interactions provide robust performances with limited sensory information. They utilize the kinematic structure and passive dynamics of the body to adapt to objects of varying shapes and properties. However, these soft passive interactions make the state of the robotic device influenced by the environment, making control generation and state estimation difficult. This work presents a closed-loop framework for dynamic interaction-based grasping that relies on two novelties: 1) a wrist-driven passive soft anthropomorphic hand that can generate robust grasp strategies using one-step kinaesthetic teaching and 2) a learning-based perception system that uses temporal data from sparse tactile sensors to predict and adapt to failures before it happens. With the anthropomorphic soft design and wrist-driven control, it is shown that controllers can be generated robust to novel objects and location uncertainty. With the learning-based high-level perception system and 32 sensing receptors, it is shown that failures can be predicted in advance, further improving the robustness of the entire system by more than doubling the grasping success rate. From over 1000 real-world grasping trials, both the control and perception framework are also seen to be transferable to novel objects and conditions. An interactive preprint version of the article can be found here:

show abstract

Modality aware contrastive learning for multimodal human activity recognition

Dixon,

Yao,

Davidson

2024

Concurrency and Computation

View full text Add to dashboard Cite

SummaryHuman activity recognition is a well‐established research problem in ubiquitous computing. The increased dependency on various smart devices in our daily lives allows us to investigate the sensor data world produced by multimodal sensors embedded in smart devices. However, the raw sensor data are often unlabeled and annotating this vast amount of data are a costly exercise that can often lead to privacy breaches. Self‐supervised learning‐based approaches are at the forefront of learning semantic representation from unlabeled sensor data, including when applied to human activity recognition tasks. As inferring human activity depends on multimodal sensors, addressing the modality difference and inter‐modality dependencies in a model is an important process. This paper proposes a novel self‐supervised learning approach, modality aware contrastive learning (MACL), for representation learning using multimodal sensor data. The approach uses different sensing modalities to create different views of an input signal. Thus, the model is able to learn the representations by maximizing the similarity among different sensing modalities of the same input signal. Extensive experiments were performed on four publicly available human activity recognition data sets to verify the effectiveness of our proposed MACL method. The experimental evaluation results show that the MACL method attains a comparable performance for human activity recognition to the compared baseline models, directly exceeding the performance of models using standard augmentation transformation strategies.

show abstract

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

Cited by 256 publications

References 55 publications

Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

Predictive Learning of Error Recovery with a Sensorized Passivity‐Based Soft Anthropomorphic Hand

Modality aware contrastive learning for multimodal human activity recognition

Contact Info

Product

Resources

About