Learning Actions from Human Demonstration Video for Robotic Manipulation

Yang, Shuo; Zhang, Wei; Lu, Weizhi; Wang, Hesheng; Li, Yibin

doi:10.1109/iros40897.2019.8968278

Cited by 24 publications

(8 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nguyen et al [3], [5] proposed to caption human actions into command sentences, which can be used to control robotic actions. Similar works can be observed to improve the capabilities of Vision-Language models under robotic settings for problems like Human-Robot Interaction [24]- [26], action learning and planning [1], [2], [4], [27], [28], etc. However, the evaluation of these methods usually involves: (1) sampling of a small fixed number of frames, which is not suitable when intermediate feedback is continuously requested in a real-time video stream; or (2) heavy reliance on object detection, which is only weakly associated to manipulation contexts.…”

Section: B Vision and Language In Roboticsmentioning

confidence: 82%

“…Intelligent robots face challenges in: (1) interpreting sensor inputs of vision and force contact interactions through modeling and learning from daily life knowledge; and (2) performing intelligent actions that take into account the surrounding physical environment as well as human intention. Various studies [1], [2], [4], [6]- [13] have structured and planned manipulation actions and activities for robotics in ways similar to human thinking, however it is still challenging to extract contextual knowledge directly from daily life.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream

Jiang¹,

Dehghan²,

Jägersand³

2020

Preprint

View full text Add to dashboard Cite

Manipulation tasks in daily life, such as pouring water, unfold intentionally under specialized manipulation contexts. Being able to process contextual knowledge in these Activities of Daily Living (ADLs) over time can help us understand manipulation intentions, which are essential for an intelligent robot to transition smoothly between various manipulation actions. In this paper, to model the intended concepts of manipulation, we present a vision dataset under a strictly constrained knowledge domain for both robot and human manipulations, where manipulation concepts and relations are stored by an ontology system in a taxonomic manner. Furthermore, we propose a scheme to generate a combination of visual attentions and an evolving knowledge graph filled with commonsense knowledge. Our scheme works with real-world camera streams and fuses an attention-based Vision-Language model with the ontology system. The experimental results demonstrate that the proposed scheme can successfully represent the evolution of an intended object manipulation procedure for both robots and humans. The proposed scheme allows the robot to mimic human-like intentional behaviors by watching real-time videos. We aim to develop this scheme further for real-world robot intelligence in Human-Robot Interaction.

show abstract

Section: B Vision and Language In Roboticsmentioning

confidence: 82%

Section: Introductionmentioning

confidence: 99%

Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream

Jiang¹,

Dehghan²,

Jägersand³

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…With the advent of deep learning ( LeCun et al, 2015 ; Goodfellow et al, 2016 ), it was possible to learn visual features characterising the task directly from raw RGB videos. The features are extracted from raw videos using a variety of methods: deep metric learning ( Sermanet et al, 2018 ), generative adversarial learning ( Stadie et al, 2017 ), domain translation ( Liu et al, 2018 ; Smith et al, 2019 ; Sharma et al, 2019 ), transfer learning ( Sharma et al, 2018 ; Sermanet et al, 2017 ), action primitives ( Jia et al, 2020 ), predictive modelling ( Tow et al, 2017 ), video to text translation ( Yang et al, 2019 ), meta-learning and ( Yu et al, 2018a ; Yu et al, 2018b ). A comparison of these methods is given in the Table 1 and a detailed study can be found in ( Pauly, 2021 ).…”

Section: Related Workmentioning

confidence: 99%

O2A: One-Shot Observational Learning with Action Vectors

et al. 2021

View full text Add to dashboard Cite

We present O2A, a novel method for learning to perform robotic manipulation tasks from a single (one-shot) third-person demonstration video. To our knowledge, it is the first time this has been done for a single demonstration. The key novelty lies in pre-training a feature extractor for creating a perceptual representation for actions that we call “action vectors”. The action vectors are extracted using a 3D-CNN model pre-trained as an action classifier on a generic action dataset. The distance between the action vectors from the observed third-person demonstration and trial robot executions is used as a reward for reinforcement learning of the demonstrated task. We report on experiments in simulation and on a real robot, with changes in viewpoint of observation, properties of the objects involved, scene background and morphology of the manipulator between the demonstration and the learning domains. O2A outperforms baseline approaches under different domain shifts and has comparable performance with an Oracle (that uses an ideal reward function). Videos of the results, including demonstrations, can be found in our: project-website.

show abstract

“…Traditional rigid robots, e.g. dexterous hands (Mattar, 2013; Rebollo et al , 2017) and robotic arms (Yang et al , 2019; Golluccio et al , 2020), are generally composed of stiff materials and bend by actuating discrete joints. Soft robots, e.g.…”

Section: Introductionmentioning

confidence: 99%

Development of a novel robotic hand with soft materials and rigid structures

Cong

Liu

et al. 2021

View full text Add to dashboard Cite

Purpose Rigid robotic hands are generally fast, precise and capable of exerting large forces, whereas soft robotic hands are compliant, safe and adaptive to complex environments. It is valuable and challenging to develop soft-rigid robotic hands that have both types of capabilities. The paper aims to address the challenge through developing a paradigm to achieve the behaviors of soft and rigid robotic hands adaptively. Design/methodology/approach The design principle of a two-joint finger is proposed. A kinematic model and a stiffness enhancement method are proposed and discussed. The manufacturing process for the soft-rigid finger is presented. Experiments are carried out to validate the accuracy of the kinematic model and evaluate the performance of the flexible body of the finger. Finally, a robotic hand composed of two soft-rigid fingers is fabricated to demonstrate its grasping capacities. Findings The kinematic model can capture the desired distal deflection and comprehensive shape accurately. The stiffness enhancement method guarantees stable grasp of the robotic hand, without sacrificing its flexibility and adaptability. The robotic hand is lightweight and practical. It can exhibit different grasping capacities. Practical implications It can be applied in the field of industrial grasping, where the objects are varied in materials and geometry. The hand’s inherent characteristic removes the need to detect and react to slight variations in surface geometry and makes the control strategies simple. Originality/value This work proposes a novel robotic hand. It possesses three distinct characteristics, i.e. high compliance, exhibiting discrete or continuous kinematics adaptively, lightweight and practical structures.

show abstract

Learning Actions from Human Demonstration Video for Robotic Manipulation

Cited by 24 publications

References 36 publications

Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream

Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream

O2A: One-Shot Observational Learning with Action Vectors

Development of a novel robotic hand with soft materials and rigid structures

Contact Info

Product

Resources

About