Proximal Policy Optimization Algorithms

Schulman, John; Wolski, Filip; Dhariwal, Prafulla; Radford, Alec; Klimov, Oleg

doi:10.48550/arxiv.1707.06347

Cited by 3,564 publications

(5,581 citation statements)

References 12 publications

Supporting

Mentioning

4,856

Contrasting

Unclassified

Order By: Relevance

“…• Proximal Policy Optimization (PPO): (Schulman et al, 2017) A model-free, on-policy, policy gradient RL method. It uses a clipped surrogate objective to limit the size of policy change at each step, thereby improving stability.…”

Section: Methodsmentioning

confidence: 99%

Neural Circuit Architectural Priors for Embodied Control

Bhattasali¹,

Zador²,

Engel³

2022

Preprint

View full text Add to dashboard Cite

Artificial neural networks for simulated motor control and robotics often adopt generic architectures like fully connected MLPs. While general, these tabula rasa architectures rely on large amounts of experience to learn, are not easily transferable to new bodies, and have internal dynamics that are difficult to interpret. In nature, animals are born with highly structured connectivity in their nervous systems shaped by evolution; this innate circuitry acts synergistically with learning mechanisms to provide inductive biases that enable most animals to function well soon after birth and improve abilities efficiently. Convolutional networks inspired by visual circuitry have encoded useful biases for vision. However, it is unknown the extent to which ANN architectures inspired by neural circuitry can yield useful biases for other domains. In this work, we ask what advantages biologically inspired network architecture can provide in the context of motor control. Specifically, we translate C. elegans circuits for locomotion into an ANN model controlling a simulated Swimmer agent. On a locomotion task, our architecture achieves good initial performance and asymptotic performance comparable with MLPs, while dramatically improving data efficiency and requiring orders of magnitude fewer parameters. Our architecture is more interpretable and transfers to new body designs. An ablation analysis shows that principled excitation/inhibition is crucial for learning, while weight initialization contributes to good initial performance. Our work demonstrates several advantages of ANN architectures inspired by systems neuroscience and suggests a path towards modeling more complex behavior.

show abstract

Section: Methodsmentioning

confidence: 99%

Neural Circuit Architectural Priors for Embodied Control

Bhattasali¹,

Zador²,

Engel³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The Q-Learning [60], Deep Q-Network (DQN) [36] and its variants such as Double-DQN [21] are normally designed for discrete action space tasks. To enable continuous action space, policy-based algorithms such as Proximal Policy Optimization (PPO) [45], Trust Region Policy Optimization (TRPO) [44] and Soft Actor-Critic [19] have been proposed. These algorithms represent the stochastic policy by a Gaussian distribution, and the agent can sample from the distribution to get the specific action.…”

Section: Related Workmentioning

confidence: 99%

“…To enable a more general allocation decision-making, continuous action space is required [45,19]. For continuous action space sequential allocation problems, the RL algorithms need to satisfy the simplex constraints as outlined above.…”

mentioning

confidence: 99%

See 1 more Smart Citation

A Prescriptive Dirichlet Power Allocation Policy with Deep Reinforcement Learning

Tian¹,

Han²,

Kulkarni³

et al. 2022

Preprint

View full text Add to dashboard Cite

Prescribing optimal operation based on the condition of the system and, thereby, potentially prolonging the remaining useful lifetime has a large potential for actively managing the availability, maintenance and costs of complex systems. Reinforcement learning (RL) algorithms are particularly suitable for this type of problems given their learning capabilities. A special case of a prescriptive operation is the power allocation task, which can be considered as a sequential allocation problem, where the action space is bounded by a simplex constraint. A general continuous action-space solution of such sequential allocation problems has still remained an open research question for RL algorithms. In continuous action-space, the standard Gaussian policy applied in reinforcement learning does not support simplex constraints, while the Gaussiansoftmax policy introduces a bias during training. In this work, we propose the Dirichlet policy for continuous allocation tasks and analyze the bias and variance of its policy gradients. We demonstrate that the Dirichlet policy is bias-free and provides significantly faster convergence, better performance and better hyperparameters robustness over the Gaussiansoftmax policy. Moreover, we demonstrate the applicability of the proposed algorithm on a prescriptive operation case, where we propose the Dirichlet power allocation policy and evaluate the performance on a case study of a set of multiple lithium-ion (Li-I) battery systems. The experimental results show the potential to prescribe optimal operation, improve the efficiency and sustainability of multi-power source systems.

show abstract

“…The classic Policy Iteration (PI) Howard (1960) and Value Iteration (VI) algorithms are the basis for most state-of-theart reinforcement learning (RL) algorithms. As both PI and VI are based on a one-step greedy approach for policy improvement, so are the most commonly used policy-gradient Schulman et al (2017); Haarnoja et al (2018) and Q-learning Mnih et al (2013); Hessel et al (2018) based approaches. In each iteration, they perform an improvement of their current policy by looking one step forward and acting greedily.…”

Section: Introductionmentioning

confidence: 99%

Planning and Learning with Adaptive Lookahead

Rosenberg¹,

Hallak²,

Mannor³

et al. 2022

Preprint

View full text Add to dashboard Cite

The classical Policy Iteration (PI) algorithm alternates between greedy one-step policy improvement and policy evaluation. Recent literature shows that multi-step lookahead policy improvement leads to a better convergence rate at the expense of increased complexity per iteration. However, prior to running the algorithm, one cannot tell what is the best fixed lookahead horizon. Moreover, per a given run, using a lookahead of horizon larger than one is often wasteful. In this work, we propose for the first time to dynamically adapt the multi-step lookahead horizon as a function of the state and of the value estimate. We devise two PI variants and analyze the trade-off between iteration count and computational complexity per iteration. The first variant takes the desired contraction factor as the objective and minimizes the per-iteration complexity. The second variant takes as input the computational complexity per iteration and minimizes the overall contraction factor. We then devise a corresponding DQN-based algorithm with an adaptive tree search horizon. We also include a novel enhancement for on-policy learning: per-depth value function estimator. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and in Atari.

show abstract

Proximal Policy Optimization Algorithms

Cited by 3,564 publications

References 12 publications

Neural Circuit Architectural Priors for Embodied Control

Neural Circuit Architectural Priors for Embodied Control

A Prescriptive Dirichlet Power Allocation Policy with Deep Reinforcement Learning

Planning and Learning with Adaptive Lookahead

Contact Info

Product

Resources

About