Learning to Score Behaviors for Guided Policy Optimization

Pacchiano, Aldo; Parker-Holder, Jack; Tang, Yunhao; Choromanska, Anna; Choromański, Krzysztof; Jordan, Michael I.

doi:10.48550/arxiv.1906.04349

Cited by 4 publications

(15 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This notion of behavior, with slight modifications, has appeared in several papers in the Reinforcement Learning literature [23][24][25][26]. At least one existing work uses this notion of behavior in Novelty Search [23].…”

Section: Primitive Behaviormentioning

confidence: 99%

See 1 more Smart Citation

Agent Spaces

Raisbeck¹,

Allen²,

Lee³

2021

Preprint

View full text Add to dashboard Cite

Exploration is one of the most important tasks in Reinforcement Learning, but it is not well-defined beyond finite problems in the Dynamic Programming paradigm (see Subsection 2.4). We provide a reinterpretation of exploration which can be applied to any online learning method.We come to this definition by approaching exploration from a new direction. After finding that concepts of exploration created to solve simple Markov decision processes with Dynamic Programming are no longer broadly applicable, we reexamine exploration. Instead of extending the ends of dynamic exploration procedures, we extend their means. That is, rather than repeatedly sampling every state-action pair possible in a process, we define the act of modifying an agent to itself be explorative. The resulting definition of exploration can be applied in infinite problems and non-dynamic learning methods, which the dynamic notion of exploration cannot tolerate.To understand the way that modifications of an agent affect learning, we describe a novel structure on the set of agents: a collection of distances (see footnote 7) da, a ∈ A, which represent the perspectives of each agent possible in the process. Using these distances, we define a topology and show that many important structures in Reinforcement Learning are well behaved under the topology induced by convergence in the agent space.

show abstract

Section: Primitive Behaviormentioning

confidence: 99%

“…Another [24] uses it for optimization with an algorithm other than Novelty Search. [23,25,26] weight the constituent distances (i.e. w s is not constant), and [25] uses primitive behavior to study the relationship between behavior and reward.…”

Section: Primitive Behaviormentioning

confidence: 99%

Agent Spaces

Raisbeck¹,

Allen²,

Lee³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, Proximal Policy Optimization (PPO) [17] penalizes on the KL divergence between the old and the new policies, and it can be efficiently solved by a first-order method like Gradient Descent. Similarly, the Behavior Guided Policy Gradient (BGPG) [18] considers the entrophy-regularized Wasserstein distance between the old and the new policies, and penalizes the Wasserstein distance to prevent large policy updates.…”

Section: Introductionmentioning

confidence: 99%

Optimistic Distributionally Robust Policy Optimization

Song,

Zhao

2020

Preprint

View full text Add to dashboard Cite

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), as the widely employed policy based reinforcement learning (RL) methods, are prone to converge to a sub-optimal solution as they limit the policy representation to a particular parametric distribution class. To address this issue, we develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRPO) algorithm, which effectively utilizes Optimistic Distributionally Robust Optimization (DRO) approach to solve the trust region constrained optimization problem without parameterizing the policies. Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability. Moreover, it achieves a globally optimal policy update that is not promised in the prevailing policy based RL algorithms. Experiments across tabular domains and robotic locomotion tasks demonstrate the effectiveness of our approach.Preprint. Under review.

show abstract

“…While those methods achieve impressive performance, and the choice of the KL is well-motivated, one can still ask if it is possible to include information about the behavior of policies when measuring similarity, and whether this could lead to more efficient algorithms. Pacchiano et al (2019) provide a first insight into this question, representing policies using behavioral distributions which incorporate information about the outcome of the policies in the environment. The Wasserstein Distance (WD) (Villani, 2016) between those behavioral distributions is then used as a similarity measure between their corresponding policies.…”

Section: Introductionmentioning

confidence: 99%

“…Behavior-Guided Policy Optimization. Motivated by the idea that policies can differ substantially as measured by their KL divergence but still behave similarly in the environment, Pacchiano et al (2019) recently proposed to use a notion of proximity in behavior between policies for PO. Exploiting similarity in behavior during optimization allows to take larger steps in directions where policies behave similarly despite having a large KL divergence.…”

Section: Introductionmentioning

confidence: 99%

Efficient Wasserstein Natural Gradients for Reinforcement Learning

Moskovitz¹,

Arbel²,

Huszár³

et al. 2020

Preprint

View full text Add to dashboard Cite

A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient Wasserstein natural gradient (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including a divergence penalty in the objective to establish a trust region. Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baselines.

show abstract

Learning to Score Behaviors for Guided Policy Optimization

Cited by 4 publications

References 0 publications

Agent Spaces

Agent Spaces

Optimistic Distributionally Robust Policy Optimization

Efficient Wasserstein Natural Gradients for Reinforcement Learning

Contact Info

Product

Resources

About