Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Gu, Shixiang; Lillicrap, Timothy P.; Ghahramani, Zoubin; Turner, Richard E.; Levine, Sergey

doi:10.48550/arxiv.1611.02247

Cited by 80 publications

(103 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the MuJoCo-gym environments, we only consider results that were reported with the v1 version of the respective environment up to 2019 as the earliest publication of the latest result we found for v1 (Abdolmaleki et al, 2018a) came out in December 2018, but include results that use v2 or an ambiguous version from 2019 and 2020. Over all, we considered TRPO (Schulman et al, 2015), DDPG (Lillicrap et al, 2015), Q-Prop (Gu et al, 2016), Soft Q-learning (Haarnoja et al, 2017), ACKTR (Wu et al, 2017), PPO (Schulman et al, 2017), Clipped Action Policy Gradients (Fujita & Maeda, 2018), TD3 (Fujimoto et al, 2018), STEVE (Buckman et al, 2018), SAC (Haarnoja et al, 2018) and Relative Entropy Regularized Policy Iteration (Abdolmaleki et al, 2018a) for gym-v1.…”

Section: B2 Mujoco-gymmentioning

confidence: 99%

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

Dorner¹

2021

Preprint

View full text Add to dashboard Cite

Sampled environment transitions are a critical input to deep reinforcement learning (DRL) algorithms. Current DRL benchmarks often allow for the cheap and easy generation of large amounts of samples such that perceived progress in DRL does not necessarily correspond to improved sample efficiency. As simulating real world processes is often prohibitively hard and collecting real world experience is costly, sample efficiency is an important indicator for economically relevant applications of DRL. We investigate progress in sample efficiency on Atari games and continuous control tasks by comparing the number of samples that a variety of algorithms need to reach a given performance level according to training curves in the corresponding publications. We find exponential progress in sample efficiency with estimated doubling times of around 10 to 18 months on Atari, 5 to 24 months on state-based continuous control and of around 4 to 9 months on pixel-based continuous control depending on the specific task and performance level.

show abstract

Section: B2 Mujoco-gymmentioning

confidence: 99%

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

Dorner¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The amount of time needed for an agent to learn high reward yielding behavior cannot be predetermined and depends on a host of factors including the complexity of the environment, the complexity of the agent, and more. Yet, overall, it has been well established that deep RL agents tend to be very sample inefficient (Gu et al 2017), so it is recommended to provide a generous training budget for these agents.…”

Section: Deep Rl Agentsmentioning

confidence: 99%

Deep Reinforcement Learning for Conservation Decisions

Lapeyrolerie¹,

Chapman²,

Norman³

et al. 2021

Preprint

View full text Add to dashboard Cite

Can machine learning help us make better decisions about a changing planet? In this paper, we illustrate and discuss the potential of a promising corner of machine learning known as reinforcement learning (RL) to help tackle the most challenging conservation decision problems. RL is uniquely well suited to conservation and global change challenges for three reasons: (1) RL explicitly focuses on designing an agent who interacts with an environment which is dynamic and uncertain, (2) RL approaches do not require massive amounts of data, (3) RL approaches would utilize rather than replace existing models, simulations, and the knowledge they contain. We provide a conceptual and technical introduction to RL and its relevance to ecological and conservation challenges, including examples of a problem in setting fisheries quotas and in managing ecological tipping points. Four appendices with annotated code provide a tangible introduction to researchers looking to adopt, evaluate, or extend these approaches.

show abstract

“…Medina & Yang (2016) extended this approach to the problem of linear bandits under heavy-tailed noise. There is also a long line of work in deep RL which focuses on reducing the variance of stochastic policy gradients (Gu et al, 2016;Wu et al, 2018;Cheng et al, 2020). On the ip side, Chung et al (2020) highlighted the bene cial impacts of stochasticity of policy gradients on the optimization process.…”

Section: Related Workmentioning

confidence: 99%

On Proximal Policy Optimization's Heavy-tailed Gradients

Garg¹,

Zhanson²,

Parisotto³

et al. 2021

Preprint

View full text Add to dashboard Cite

Modern policy gradient algorithms, notably Proximal Policy Optimization (PPO), rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavytailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent's policy diverges from the behavioral policy (i.e., as the agent goes further o policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the e ects of the standard PPO clipping heuristics, demonstrating that these tricks primarily serve to o set heavy-tailedness in gradients. Thus motivated, we propose incorporating G , a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks.

show abstract

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Cited by 80 publications

References 16 publications

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

Deep Reinforcement Learning for Conservation Decisions

On Proximal Policy Optimization's Heavy-tailed Gradients

Contact Info

Product

Resources

About