Reinforcement learning vs human programming in tetherball robot games

Parisi, Simone; Abdulsamad, Hany; Paraschos, Alexandros; Daniel, Christian; Peters, Jan

doi:10.1109/iros.2015.7354296

“…To this end, seminal work in robotics has proposed using dynamical systems to model actions and trajectories, in a continuous space. Dynamic Movement Primitives (DMPs) [12,33,29] have been widely used to perform diverse, dynamic tasks such as table tennis [22], panckake flipping [16] or tether-ball [25]. They are able to model smooth, natural motions, and have in fact been used to inspire many policy learning schemes [8,5,4,40,11,7].…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Neural Dynamic Policies

Bahl¹,

Gupta²,

Pathak³

2021

Robotics: Science and Systems XVII

View full text Add to dashboard Cite

We tackle the problem of generalization to unseen configurations for dynamic tasks in the real world while learning from high-dimensional image input. The family of nonlinear dynamical system-based methods have successfully demonstrated dynamic robot behaviors but have difficulty in generalizing to unseen configurations as well as learning from image inputs. Recent works approach this issue by using deep network policies and reparameterize actions to embed the structure of dynamical systems but still struggle in domains with diverse configurations of image goals, and hence, find it difficult to generalize. In this paper, we address this dichotomy by leveraging embedding the structure of dynamical systems in a hierarchical deep policy learning framework, called Hierarchical Neural Dynamical Policies (H-NDPs). Instead of fitting deep dynamical systems to diverse data directly, H-NDPs form a curriculum by learning local dynamical system-based policies on small regions in state-space and then distill them into a global dynamical system-based policy that operates only from high-dimensional images. H-NDPs additionally provide smooth trajectories, a strong safety benefit in the real world. We perform extensive experiments on dynamic tasks both in the real world (digit writing, scooping, and pouring) and simulation (catching, throwing, picking). We show that H-NDPs are easily integrated with both imitation as well as reinforcement learning setups and achieve state-of-the-art results. Video results are at https://shikharbahl.github.io/hierarchical-ndps/.

show abstract

“…To this end, seminal work in robotics has proposed using dynamical systems to model actions and trajectories, in a continuous space. Dynamic Movement Primitives (DMPs) [12,34,30] have been widely used to perform diverse, dynamic tasks such as table tennis [23], panckake flipping [16] or tether-ball [26]. They are able to model smooth, natural motions, and have in fact been used to inspire many policy learning schemes [8,5,4,41,11,7].…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Neural Dynamic Policies

Bahl¹,

Gupta²,

Pathak³

2021

Preprint

View full text Add to dashboard Cite

We tackle the problem of generalization to unseen configurations for dynamic tasks in the real world while learning from high-dimensional image input. The family of nonlinear dynamical system-based methods have successfully demonstrated dynamic robot behaviors but have difficulty in generalizing to unseen configurations as well as learning from image inputs. Recent works approach this issue by using deep network policies and reparameterize actions to embed the structure of dynamical systems but still struggle in domains with diverse configurations of image goals, and hence, find it difficult to generalize. In this paper, we address this dichotomy by leveraging embedding the structure of dynamical systems in a hierarchical deep policy learning framework, called Hierarchical Neural Dynamical Policies (H-NDPs). Instead of fitting deep dynamical systems to diverse data directly, H-NDPs form a curriculum by learning local dynamical system-based policies on small regions in state-space and then distill them into a global dynamical system-based policy that operates only from high-dimensional images. H-NDPs additionally provide smooth trajectories, a strong safety benefit in the real world. We perform extensive experiments on dynamic tasks both in the real world (digit writing, scooping, and pouring) and simulation (catching, throwing, picking). We show that H-NDPs are easily integrated with both imitation as well as reinforcement learning setups and achieve state-of-the-art results. Video results are at https://shikharbahl.github.io/hierarchical-ndps/.

show abstract

“…Approaches maximizing such functions are commonly referred to as black-box optimizers. Such a formulation of the RL problem has been shown to be effective for learning or adjusting behavioral policies in robotic scenarios (Kupcsik et al, 2013;Parisi et al, 2015), especially when carefully designing the policy π(a|s, θ) to ensure safe behavior while using only a lowdimensional parameterization θ. One reason for the effectiveness is that the exploration of the algorithm is performed on the parameters θ instead of on the actions a.…”

Section: Application To Episodic Reinforcement Learningmentioning

confidence: 99%

“…If the policy parameterization is well-chosen for the task, this form of exploration can be much more effective. The contextual relative entropy policy search (C-REPS) algorithm (Neumann, 2011;Kupcsik et al, 2013;Parisi et al, 2015) frames the maximization of ( 15) over a task distribution µ(c) as a repeated entropy-regularized optimization max q(θ,c)…”

Section: Application To Episodic Reinforcement Learningmentioning

confidence: 99%

A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning

Klink¹,

Abdulsamad²,

Belousov³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Across machine learning, the use of curricula has shown strong empirical potential to improve learning from data by avoiding local optima of training objectives. For reinforcement learning (RL), curricula are especially interesting, as the underlying optimization has a strong tendency to get stuck in local optima due to the exploration-exploitation trade-off. Recently, a number of approaches for an automatic generation of curricula for RL have been shown to increase performance while requiring less expert knowledge compared to manually designed curricula. However, these approaches are seldomly investigated from a theoretical perspective, preventing a deeper understanding of their mechanics. In this paper, we present an approach for automated curriculum generation in RL with a clear theoretical underpinning. More precisely, we formalize the well-known self-paced learning paradigm as inducing a distribution over training tasks, which trades off between task complexity and the objective to match a desired task distribution. Experiments show that training on this induced distribution helps to avoid poor local optima across RL algorithms in different tasks with uninformative rewards and challenging exploration requirements.

show abstract

Reinforcement learning vs human programming in tetherball robot games

Cited by 13 publications

References 10 publications

Hierarchical Neural Dynamic Policies

Hierarchical Neural Dynamic Policies

Hierarchical Neural Dynamic Policies

A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning

Contact Info

Product

Resources

About