2016
DOI: 10.48550/arxiv.1611.02247
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Abstract: Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often requires large batches. TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
99
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 80 publications
(103 citation statements)
references
References 16 publications
4
99
0
Order By: Relevance
“…For the MuJoCo-gym environments, we only consider results that were reported with the v1 version of the respective environment up to 2019 as the earliest publication of the latest result we found for v1 (Abdolmaleki et al, 2018a) came out in December 2018, but include results that use v2 or an ambiguous version from 2019 and 2020. Over all, we considered TRPO (Schulman et al, 2015), DDPG (Lillicrap et al, 2015), Q-Prop (Gu et al, 2016), Soft Q-learning (Haarnoja et al, 2017), ACKTR (Wu et al, 2017), PPO (Schulman et al, 2017), Clipped Action Policy Gradients (Fujita & Maeda, 2018), TD3 (Fujimoto et al, 2018), STEVE (Buckman et al, 2018), SAC (Haarnoja et al, 2018) and Relative Entropy Regularized Policy Iteration (Abdolmaleki et al, 2018a) for gym-v1.…”
Section: B2 Mujoco-gymmentioning
confidence: 99%
“…For the MuJoCo-gym environments, we only consider results that were reported with the v1 version of the respective environment up to 2019 as the earliest publication of the latest result we found for v1 (Abdolmaleki et al, 2018a) came out in December 2018, but include results that use v2 or an ambiguous version from 2019 and 2020. Over all, we considered TRPO (Schulman et al, 2015), DDPG (Lillicrap et al, 2015), Q-Prop (Gu et al, 2016), Soft Q-learning (Haarnoja et al, 2017), ACKTR (Wu et al, 2017), PPO (Schulman et al, 2017), Clipped Action Policy Gradients (Fujita & Maeda, 2018), TD3 (Fujimoto et al, 2018), STEVE (Buckman et al, 2018), SAC (Haarnoja et al, 2018) and Relative Entropy Regularized Policy Iteration (Abdolmaleki et al, 2018a) for gym-v1.…”
Section: B2 Mujoco-gymmentioning
confidence: 99%
“…The amount of time needed for an agent to learn high reward yielding behavior cannot be predetermined and depends on a host of factors including the complexity of the environment, the complexity of the agent, and more. Yet, overall, it has been well established that deep RL agents tend to be very sample inefficient (Gu et al 2017), so it is recommended to provide a generous training budget for these agents.…”
Section: Deep Rl Agentsmentioning
confidence: 99%
“…Medina & Yang (2016) extended this approach to the problem of linear bandits under heavy-tailed noise. There is also a long line of work in deep RL which focuses on reducing the variance of stochastic policy gradients (Gu et al, 2016;Wu et al, 2018;Cheng et al, 2020). On the ip side, Chung et al (2020) highlighted the bene cial impacts of stochasticity of policy gradients on the optimization process.…”
Section: Related Workmentioning
confidence: 99%