P3O: Policy-on Policy-off Policy Optimization

Fakoor, Rasool; Chaudhari, Pratik; Smola, Alexander J.

doi:10.48550/arxiv.1905.01756

Cited by 3 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The pair of equations ( 3) and ( 4) form a coupled pair of optimization problems with variables (θ, ϕ) that can be solved by, say, taking gradient steps on each objective alternately while keeping one of the parameters fixed. Although off-policy methods have shown promising performance in various tasks and are usually more sample efficient than on-policy methods [3][4][5][6], they are often very sensitive to hyper-parameters, exploration methods, among other things [7]. This has led to a surge of interest in improving these methods.…”

Section: Problem Setupmentioning

confidence: 99%

“…These transitions are the most important to update the controller in (4), while the others in the dataset D may lead to deterioration of the controller. We follow [3,13] to estimate the propensity between the action distribution of the current policy and the action distribution of the past policies. This propensity is used to filter out transitions in the data that may lead to deterioration of the controller during the policy update.…”

Section: Contributionsmentioning

confidence: 99%

See 1 more Smart Citation

DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning

Fakoor,

Chaudhari,

Smola

2020

Preprint

Self Cite

View full text Add to dashboard Cite

This paper prescribes a suite of techniques for off-policy Reinforcement Learning (RL) that simplify the training process and reduce the sample complexity. First, we show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled. This is contrast to existing literature which creates sophisticated off-policy techniques. Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step; existing solutions such as delayed policy updates do not mitigate this issue. Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from the replay buffer and selectively update the policy to prevent deterioration of performance. We make these claims using extensive experimentation on a set of challenging MuJoCo tasks. A short video of our results can be seen at https://tinyurl.com/scs6p5m.

show abstract

Section: Problem Setupmentioning

confidence: 99%

Section: Contributionsmentioning

confidence: 99%

DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning

Fakoor,

Chaudhari,

Smola

2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Hence Deep Q-learning is still a fundamental part of methods such as DDPG [10], A3C [16], TD3 [9], SAC [11] that are all related to our method. Alternative approaches that follows the AC paradigm but employ sample estimates of the Q-function can be found in [17], TRPO [3], PPO [4] and P3O [18].…”

Section: Related Workmentioning

confidence: 99%

Proximal Deterministic Policy Gradient

Maggipinto

Susto

Chaudhari

2020

2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Self Cite

View full text Add to dashboard Cite

This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational resources. Further, we demonstrate significant performance improvement over state-of-theart algorithms on standard continuous-control RL benchmarks.

show abstract

“…Remark 3 (Picking the coefficient λ). Following Fakoor et al (2019), we pick λ = 1 − ESS for both the steps (18)(19). This relaxes the quadratic penalty if the new task is similar to the metatraining tasks ( ESS is large) and vice-versa.…”

Section: Adaptation To a New Taskmentioning

confidence: 99%

“…The off-policy updates in MQL are essential to exploiting this data. The coefficient of the proximal term in the adaptation-phase objective (18-19) using the effective sample size (ESS) is inspired from the recent work of Fakoor et al (2019).…”

Section: Related Workmentioning

confidence: 99%

Meta-Q-Learning

Fakoor¹,

Chaudhari²,

Soatto³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm for meta-Reinforcement Learning (meta-RL). MQL builds upon three simple ideas. First, we show that Q-learning is competitive with state of the art meta-RL algorithms if given access to a context variable that is a representation of the past trajectory. Second, using a multi-task objective to maximize the average reward across the training tasks is an effective method to meta-train RL policies. Third, past data from the meta-training replay buffer can be recycled to adapt the policy on a new task using off-policy updates. MQL draws upon ideas in propensity estimation to do so and thereby amplifies the amount of available data for adaptation. Experiments on standard continuous-control benchmarks suggest that MQL compares favorably with state of the art meta-RL algorithms.

show abstract

P3O: Policy-on Policy-off Policy Optimization

Cited by 3 publications

References 0 publications

DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning

DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning

Proximal Deterministic Policy Gradient

Meta-Q-Learning

Contact Info

Product

Resources

About