2015
DOI: 10.48550/arxiv.1502.05477
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Trust Region Policy Optimization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
87
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 88 publications
(107 citation statements)
references
References 0 publications
0
87
0
Order By: Relevance
“…However, improvement in convergence speed is achieved by sacrificing the stability of convergence, and it is hard to converge in earlier-stage training. Asynchronous advantage actor-critic (A3C) [33][45], advantage actor-critic (A2C) [29] [36], trust region policy optimization (TRPO) [69] and proximal policy optimization (PPO) [70] algorithms are then invented to cope with this shortcoming. Multi-thread technique [45] is utilized in A3C and A2C to accelerate the speed of convergence, while TRPO and PPO improve the policy of actor-critic algorithm by introducing trust region constraint in TRPO, and "surrogate" and adaptive penalty in PPO to improve speed and stability of convergence.…”
Section: Classification Of Planning Algorithmsmentioning
confidence: 99%
See 1 more Smart Citation
“…However, improvement in convergence speed is achieved by sacrificing the stability of convergence, and it is hard to converge in earlier-stage training. Asynchronous advantage actor-critic (A3C) [33][45], advantage actor-critic (A2C) [29] [36], trust region policy optimization (TRPO) [69] and proximal policy optimization (PPO) [70] algorithms are then invented to cope with this shortcoming. Multi-thread technique [45] is utilized in A3C and A2C to accelerate the speed of convergence, while TRPO and PPO improve the policy of actor-critic algorithm by introducing trust region constraint in TRPO, and "surrogate" and adaptive penalty in PPO to improve speed and stability of convergence.…”
Section: Classification Of Planning Algorithmsmentioning
confidence: 99%
“…PPO [70][85] is the optimized version of TRPO [69]. Hence, here we first introduce TRPO, and then introduce PPO.…”
Section: Trpo and Ppomentioning
confidence: 99%
“…Specifically, as our RL approach we decided to use a Proximal Policy Optimization (PPO) algorithm [45], which is known to be a modern, general, easy-to-implement and sampleefficient variant of policy gradient techniques. This algorithm is closely related to Trust Region Policy Optimization (TRPO) [46] techniques, in the sense that they rely on updating the current policy according to some constraints, limiting sudden jumps in the latter (see figure 2 and the appendix for the general layout of the algorithm). Even though this is a time-dependent control problem, using recurrent neural networks, i.e.…”
Section: Paragraphmentioning
confidence: 99%
“…. Inspired by Schulman et al (2015), in which the time-0 value function between two policies is shown to be equal to the expected advantage, together with importance sampling and KL divergence constraint reformulation, the first component in the surrogate performance measure of PPO is given by:…”
Section: Reinforce: Monte Carlo Policy Gradientmentioning
confidence: 99%