High-Dimensional Continuous Control Using Generalized Advantage Estimation

John, Sabu; Moritz, Philipp; Levine, Sergey; Jordan, Michael I.; Abbeel, Pieter

doi:10.48550/arxiv.1506.02438

Cited by 534 publications

(697 citation statements)

References 12 publications

Supporting

Mentioning

644

Contrasting

Order By: Relevance

“…In the analysis, we introduce a slightly modified reward function instead of using the commonly applied function described in [24], because the existing reward function [24] is likely to result in a conservative policy in which the robot does not walk but remains in place. The existing reward function [24] can be defined by…”

Section: Reward Functionmentioning

confidence: 99%

Reinforcement Learning with Adaptive Curriculum Dynamics Randomization for Fault-Tolerant Robot Control

Okamoto¹,

Kera²,

Kawamoto³

2021

Preprint

View full text Add to dashboard Cite

This study is aimed at addressing the problem of fault tolerance of quadruped robots to actuator failure, which is critical for robots operating in remote or extreme environments. In particular, an adaptive curriculum reinforcement learning algorithm with dynamics randomization (ACDR) is established. The ACDR algorithm can adaptively train a quadruped robot in random actuator failure conditions and formulate a single robust policy for fault-tolerant robot control. It is noted that the hard2easy curriculum is more effective than the easy2hard curriculum for quadruped robot locomotion. The ACDR algorithm can be used to build a robot system that does not require additional modules for detecting actuator failures and switching policies. Experimental results show that the ACDR algorithm outperforms conventional algorithms in terms of the average reward and walking distance.

show abstract

Section: Reward Functionmentioning

confidence: 99%

Reinforcement Learning with Adaptive Curriculum Dynamics Randomization for Fault-Tolerant Robot Control

Okamoto¹,

Kera²,

Kawamoto³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…There are multiple choices of the advantage function [23], and we use the baseline version of the Monte-Carlo returns to reduce the variance: Evaluate MBPO and construct the reward as the average return: R i = Avg(η);…”

Section: Hyper-controller Learningmentioning

confidence: 99%

On Effective Scheduling of Model-based Reinforcement Learning

Lai

Shen

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.

show abstract

“…This formulation allows us to use advanced, actor-critic-type approaches [25] to improve the sample efficiency. In our implementation, we use generalized advantage estimation (GAE) [26] A(s, a 1 , ..., a n ) = Q(s, a 1 , ..., a n ) − V (s) in place of the Q function to calculate the gradient and Hessian terms. After sampling a batch of states, actions, and rewards from the replay buffer, we construct two pseudo-objectives, one for the first derivative terms and one for the mixed Hessian terms required for the PCGD update.…”

Section: Multiagent Reinforcement Learning (Marl)mentioning

confidence: 99%

Polymatrix Competitive Gradient Descent

Ma¹,

Letcher²,

Schäfer³

et al. 2021

Preprint

View full text Add to dashboard Cite

Many economic games and machine learning approaches can be cast as competitive optimization problems where multiple agents are minimizing their respective objective function, which depends on all agents' actions. While gradient descent is a reliable basic workhorse for singleagent optimization, it often leads to oscillation in competitive optimization. In this work we propose polymatrix competitive gradient descent (PCGD) as a method for solving general sum competitive optimization involving arbitrary numbers of agents. The updates of our method are obtained as the Nash equilibria of a local polymatrix approximation with a quadratic regularization, and can be computed efficiently by solving a linear system of equations. We prove local convergence of PCGD to stable fixed points for n-player general-sum games, and show that it does not require adapting the step size to the strength of the player-interactions. We use PCGD to optimize policies in multi-agent reinforcement learning and demonstrate its advantages in Snake, Markov soccer and an electricity market game. Agents trained by PCGD outperform agents trained with simultaneous gradient descent, symplectic gradient adjustment, and extragradient in Snake and Markov soccer games and on the electricity market game, PCGD trains faster than both simultaneous gradient descent and the extragradient method.

show abstract

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Cited by 534 publications

References 12 publications

Reinforcement Learning with Adaptive Curriculum Dynamics Randomization for Fault-Tolerant Robot Control

Reinforcement Learning with Adaptive Curriculum Dynamics Randomization for Fault-Tolerant Robot Control

On Effective Scheduling of Model-based Reinforcement Learning

Polymatrix Competitive Gradient Descent

Contact Info

Product

Resources

About