Compliant skills acquisition and multi-optima policy search with EM-based reinforcement learning

Calinon, Sylvain; Kormushev, Petar; Caldwell, Darwin G.

doi:10.1016/j.robot.2012.09.012

Cited by 50 publications

(36 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is applied to learning the "ball in a cup" game, using a DMP as a parameterized policy. A similar algorithm was used in [5] in a pancake flipping task using a different parameterized policy consisting of a time-dependent mixture of proportional-derivative systems. PI 2 is a policy search method derived from the first principles of stochastic optimal control [14].…”

Section: Policy Search Methods For Learning Motions From Experiencementioning

confidence: 99%

“…They have convenient properties, such as the possibility to allow non-constrained learning while ensuring qualitative behavior such as global stability at an attractor point or convergence to a limit cycle. There have been numerous successful demonstrations of learning from demonstration and reinforcement learning in such motion representations [18,19,5] learning motions for a large variety of tasks.…”

Section: Representing Motions For Control In Roboticsmentioning

confidence: 99%

“…These two paradigms are complementary and often used in conjunction, e.g. by using LfD as a means to find a good initial task model which is subsequently refined by RL [5,6].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies

Rey

Kronander

Farshidian³

et al. 2017

Auton Robot

View full text Add to dashboard Cite

An important challenge when using Reinforcement Learning for learning motions in robotics is the choice of parameterization for the policy. We use Gaussian Mixture Regression to extract a parameterization with relevant non-linear features from a set of demonstrations of a motion following the paradigm of Learning from Demonstration. The resulting parameterization takes the form of a non-linear time-invariant dynamical system (DS). We use this time-invariant DS as a parameterized policy for a variant of the PI 2 policy search algorithm. This paper contributes by adapting PI 2 for our time-invariant motion representation. We introduce two novel parameter exploration schemes that can be used to 1) sample model parameters to achieve a uniform exploration in state space and 2) explore while ensuring stability of the resulting motion model. Additionally, a state dependent stiffness profile is learned simultaneously to the reference trajectory and both are used together in a variable impedance control architecture. This learning architecture is validated in a hardware experiment consisting of a digging task using a KUKA LWR platform.

show abstract

Section: Policy Search Methods For Learning Motions From Experiencementioning

confidence: 99%

Section: Representing Motions For Control In Roboticsmentioning

confidence: 99%

See 1 more Smart Citation

Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies

Rey

Kronander

Farshidian³

et al. 2017

Auton Robot

View full text Add to dashboard Cite

show abstract

“…This number is initially chosen to be larger than the number of possible optima, and redundant options are deleted as the algorithm runs, with care taken to avoid premature deletion. Calinon et al [15] uses a Gaussian Mixture Model to estimate the structure of the cost function J, mapping the cost as a probability density function that depends on the policy parameters. Exploration noise is decayed manually, rather than automating it with covariance matrix adaptation.…”

Section: Problem Statementmentioning

confidence: 99%

“…Exploration noise is decayed manually, rather than automating it with covariance matrix adaptation. The key difference between the algorithms in [7], [15] and SODIRS is that the former consider multiple skill options that are able to solve the same task, whereas SODIRS learns multiple options for multiple task variations, without requiring the user to specify the task parameter space.…”

Section: Problem Statementmentioning

confidence: 99%

Simultaneous on-line Discovery and Improvement of Robotic Skill options

Stulp

Herlant

Hoarau

et al. 2014

2014 IEEE/RSJ International Conference on Intelligent Robots and Systems

View full text Add to dashboard Cite

Abstract-The regularity of everyday tasks enables us to reuse existing solutions for task variations. For instance, most door-handles require the same basic skill (reach, grasp, turn, pull), but small adaptations of the basic skill are required to adapt to the variations that exist (e.g. levers vs. knobs). We introduce the algorithm "Simultaneous On-line Discovery and Improvement of Robotic Skills" (SODIRS) that is able to autonomously discover and optimize skill options for such task variations. We formalize the problem in a reinforcement learning context, and use the PI BB algorithm [2] to continually optimize skills with respect to a cost function. SODIRS discovers new subskills, or "skill options", by clustering the costs of trials, and determining whether perceptual features are able to predict which cluster a trial will belong to. This enables SODIRS to build a decision tree, in which the leaves contain skill options for task variations. We demonstrate SODIRS' performance in simulation, as well as on a Meka humanoid robot performing the ball-in-cup task.

show abstract

A Maximum Entropy Deep Reinforcement Learning Neural Tracker

Balaram

Arulkumaran

Dai

et al. 2019

Machine Learning in Medical Imaging

View full text Add to dashboard Cite

Tracking of anatomical structures has multiple applications in the field of biomedical imaging, including screening, diagnosing and monitoring the evolution of pathologies. Semi-automated tracking of elongated structures has been previously formulated as a task for deep reinforcement learning (DRL), albeit it remains a challenge. We introduce a maximum entropy continuous-action DRL neural tracker capable of training from scratch in a complex environment in the presence of high noise levels, Gaussian blurring and cell detractors. The trained model is evaluated on mouse cortical two-photon microscopy images. At the expense of slightly worse robustness compared to a previously applied DRL tracker, we reach significantly higher accuracy, approaching the performance of the standard hand-engineered algorithm used for neuron tracing. The higher sample efficiency of our maximum entropy DRL tracker indicates its potential of being applied directly to small biomedical datasets in the absence of artificial models.

show abstract

Compliant skills acquisition and multi-optima policy search with EM-based reinforcement learning

Cited by 50 publications

References 44 publications

Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies

Learning motions from demonstrations and rewards with time-invariant dynamical systems based policies

Simultaneous on-line Discovery and Improvement of Robotic Skill options

A Maximum Entropy Deep Reinforcement Learning Neural Tracker

Contact Info

Product

Resources

About