2019
DOI: 10.48550/arxiv.1906.04349
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning to Score Behaviors for Guided Policy Optimization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
15
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(15 citation statements)
references
References 0 publications
0
15
0
Order By: Relevance
“…This notion of behavior, with slight modifications, has appeared in several papers in the Reinforcement Learning literature [23][24][25][26]. At least one existing work uses this notion of behavior in Novelty Search [23].…”
Section: Primitive Behaviormentioning
confidence: 99%
See 1 more Smart Citation
“…This notion of behavior, with slight modifications, has appeared in several papers in the Reinforcement Learning literature [23][24][25][26]. At least one existing work uses this notion of behavior in Novelty Search [23].…”
Section: Primitive Behaviormentioning
confidence: 99%
“…Another [24] uses it for optimization with an algorithm other than Novelty Search. [23,25,26] weight the constituent distances (i.e. w s is not constant), and [25] uses primitive behavior to study the relationship between behavior and reward.…”
Section: Primitive Behaviormentioning
confidence: 99%
“…For example, Proximal Policy Optimization (PPO) [17] penalizes on the KL divergence between the old and the new policies, and it can be efficiently solved by a first-order method like Gradient Descent. Similarly, the Behavior Guided Policy Gradient (BGPG) [18] considers the entrophy-regularized Wasserstein distance between the old and the new policies, and penalizes the Wasserstein distance to prevent large policy updates.…”
Section: Introductionmentioning
confidence: 99%
“…While those methods achieve impressive performance, and the choice of the KL is well-motivated, one can still ask if it is possible to include information about the behavior of policies when measuring similarity, and whether this could lead to more efficient algorithms. Pacchiano et al (2019) provide a first insight into this question, representing policies using behavioral distributions which incorporate information about the outcome of the policies in the environment. The Wasserstein Distance (WD) (Villani, 2016) between those behavioral distributions is then used as a similarity measure between their corresponding policies.…”
Section: Introductionmentioning
confidence: 99%
“…Behavior-Guided Policy Optimization. Motivated by the idea that policies can differ substantially as measured by their KL divergence but still behave similarly in the environment, Pacchiano et al (2019) recently proposed to use a notion of proximity in behavior between policies for PO. Exploiting similarity in behavior during optimization allows to take larger steps in directions where policies behave similarly despite having a large KL divergence.…”
Section: Introductionmentioning
confidence: 99%