2017
DOI: 10.48550/arxiv.1707.06347
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Proximal Policy Optimization Algorithms

John Schulman,
Filip Wolski,
Prafulla Dhariwal
et al.

Abstract: We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of tru… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

18
4,856
0
7

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 3,564 publications
(5,581 citation statements)
references
References 12 publications
18
4,856
0
7
Order By: Relevance
“…• Proximal Policy Optimization (PPO): (Schulman et al, 2017) A model-free, on-policy, policy gradient RL method. It uses a clipped surrogate objective to limit the size of policy change at each step, thereby improving stability.…”
Section: Methodsmentioning
confidence: 99%
“…• Proximal Policy Optimization (PPO): (Schulman et al, 2017) A model-free, on-policy, policy gradient RL method. It uses a clipped surrogate objective to limit the size of policy change at each step, thereby improving stability.…”
Section: Methodsmentioning
confidence: 99%
“…The Q-Learning [60], Deep Q-Network (DQN) [36] and its variants such as Double-DQN [21] are normally designed for discrete action space tasks. To enable continuous action space, policy-based algorithms such as Proximal Policy Optimization (PPO) [45], Trust Region Policy Optimization (TRPO) [44] and Soft Actor-Critic [19] have been proposed. These algorithms represent the stochastic policy by a Gaussian distribution, and the agent can sample from the distribution to get the specific action.…”
Section: Related Workmentioning
confidence: 99%
“…To enable a more general allocation decision-making, continuous action space is required [45,19]. For continuous action space sequential allocation problems, the RL algorithms need to satisfy the simplex constraints as outlined above.…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…The classic Policy Iteration (PI) Howard (1960) and Value Iteration (VI) algorithms are the basis for most state-of-theart reinforcement learning (RL) algorithms. As both PI and VI are based on a one-step greedy approach for policy improvement, so are the most commonly used policy-gradient Schulman et al (2017); Haarnoja et al (2018) and Q-learning Mnih et al (2013); Hessel et al (2018) based approaches. In each iteration, they perform an improvement of their current policy by looking one step forward and acting greedily.…”
Section: Introductionmentioning
confidence: 99%