Trust Region Policy Optimization

John, Sabu; Levine, Sergey; Moritz, Philipp; Jordan, Michael I.; Abbeel, Pieter

doi:10.48550/arxiv.1502.05477

Cited by 88 publications

(107 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, improvement in convergence speed is achieved by sacrificing the stability of convergence, and it is hard to converge in earlier-stage training. Asynchronous advantage actor-critic (A3C) [33][45], advantage actor-critic (A2C) [29] [36], trust region policy optimization (TRPO) [69] and proximal policy optimization (PPO) [70] algorithms are then invented to cope with this shortcoming. Multi-thread technique [45] is utilized in A3C and A2C to accelerate the speed of convergence, while TRPO and PPO improve the policy of actor-critic algorithm by introducing trust region constraint in TRPO, and "surrogate" and adaptive penalty in PPO to improve speed and stability of convergence.…”

Section: Classification Of Planning Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

A review of motion planning algorithms for intelligent robots

2021

View full text Add to dashboard Cite

Principles of typical motion planning algorithms are investigated and analyzed in this paper. These algorithms include traditional planning algorithms, classical machine learning algorithms, optimal value reinforcement learning, and policy gradient reinforcement learning. Traditional planning algorithms investigated include graph search algorithms, sampling-based algorithms, interpolating curve algorithms, and reaction-based algorithms. Classical machine learning algorithms include multiclass support vector machine, long short-term memory, Monte-Carlo tree search and convolutional neural network. Optimal value reinforcement learning algorithms include Q learning, deep Q-learning network, double deep Q-learning network, dueling deep Q-learning network. Policy gradient algorithms include policy gradient method, actor-critic algorithm, asynchronous advantage actor-critic, advantage actor-critic, deterministic policy gradient, deep deterministic policy gradient, trust region policy optimization and proximal policy optimization. New general criteria are also introduced to evaluate the performance and application of motion planning algorithms by analytical comparisons. The convergence speed and stability of optimal value and policy gradient algorithms are specially analyzed. Future directions are presented analytically according to principles and analytical comparisons of motion planning algorithms. This paper provides researchers with a clear and comprehensive understanding about advantages, disadvantages, relationships, and future of motion planning algorithms in robots, and paves ways for better motion planning algorithms in academia, engineering, and manufacturing.

show abstract

Section: Classification Of Planning Algorithmsmentioning

confidence: 99%

“…PPO [70][85] is the optimized version of TRPO [69]. Hence, here we first introduce TRPO, and then introduce PPO.…”

Section: Trpo and Ppomentioning

confidence: 99%

A review of motion planning algorithms for intelligent robots

2021

View full text Add to dashboard Cite

show abstract

“…Specifically, as our RL approach we decided to use a Proximal Policy Optimization (PPO) algorithm [45], which is known to be a modern, general, easy-to-implement and sampleefficient variant of policy gradient techniques. This algorithm is closely related to Trust Region Policy Optimization (TRPO) [46] techniques, in the sense that they rely on updating the current policy according to some constraints, limiting sudden jumps in the latter (see figure 2 and the appendix for the general layout of the algorithm). Even though this is a time-dependent control problem, using recurrent neural networks, i.e.…”

Section: Paragraphmentioning

confidence: 99%

Deep Reinforcement Learning for Quantum State Preparation with Weak Nonlinear Measurements

Porotti,

Essig,

Huard

et al. 2021

Preprint

View full text Add to dashboard Cite

Quantum control has been of increasing interest in recent years, e.g. for tasks like state initialization and stabilization. Feedback-based strategies are particularly powerful, but also hard to find, due to the exponentially increased search space. Deep reinforcement learning holds great promise in this regard. It may provide new answers to difficult questions, such as whether nonlinear measurements can compensate for linear, constrained control. Here we show that reinforcement learning can successfully discover such feedback strategies, without prior knowledge. We illustrate this for state preparation in a cavity subject to quantum-non-demolition detection of photon number, with a simple linear drive as control. Fock states can be produced and stabilized at very high fidelity. It is even possible to reach superposition states, provided the measurement rates for different Fock states can be controlled as well.

show abstract

“…. Inspired by Schulman et al (2015), in which the time-0 value function between two policies is shown to be equal to the expected advantage, together with importance sampling and KL divergence constraint reformulation, the first component in the surrogate performance measure of PPO is given by:…”

Section: Reinforce: Monte Carlo Policy Gradientmentioning

confidence: 99%

Pseudo-Model-Free Hedging for Variable Annuities via Deep Reinforcement Learning

Chong¹,

Cui²,

Li³

2021

Preprint

View full text Add to dashboard Cite

This paper applies a deep reinforcement learning approach to revisit the hedging problem of variable annuities. Instead of assuming actuarial and financial dualmarket model a priori, the reinforcement learning agent learns how to hedge by collecting anchor-hedging reward signals through interactions with the market. By the recently advanced proximal policy optimization, the pseudo-model-free reinforcement learning agent performs equally well as the correct Delta, while outperforms the misspecified Deltas. The reinforcement learning agent is also integrated with online learning to demonstrate its full adaptive capability to the market.

show abstract

Trust Region Policy Optimization

Cited by 88 publications

References 0 publications

A review of motion planning algorithms for intelligent robots

A review of motion planning algorithms for intelligent robots

Deep Reinforcement Learning for Quantum State Preparation with Weak Nonlinear Measurements

Pseudo-Model-Free Hedging for Variable Annuities via Deep Reinforcement Learning

Contact Info

Product

Resources

About