2019
DOI: 10.48550/arxiv.1905.01756
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

P3O: Policy-on Policy-off Policy Optimization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
3

Relationship

3
0

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 0 publications
0
6
0
Order By: Relevance
“…The pair of equations ( 3) and ( 4) form a coupled pair of optimization problems with variables (θ, ϕ) that can be solved by, say, taking gradient steps on each objective alternately while keeping one of the parameters fixed. Although off-policy methods have shown promising performance in various tasks and are usually more sample efficient than on-policy methods [3][4][5][6], they are often very sensitive to hyper-parameters, exploration methods, among other things [7]. This has led to a surge of interest in improving these methods.…”
Section: Problem Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…The pair of equations ( 3) and ( 4) form a coupled pair of optimization problems with variables (θ, ϕ) that can be solved by, say, taking gradient steps on each objective alternately while keeping one of the parameters fixed. Although off-policy methods have shown promising performance in various tasks and are usually more sample efficient than on-policy methods [3][4][5][6], they are often very sensitive to hyper-parameters, exploration methods, among other things [7]. This has led to a surge of interest in improving these methods.…”
Section: Problem Setupmentioning
confidence: 99%
“…These transitions are the most important to update the controller in (4), while the others in the dataset D may lead to deterioration of the controller. We follow [3,13] to estimate the propensity between the action distribution of the current policy and the action distribution of the past policies. This propensity is used to filter out transitions in the data that may lead to deterioration of the controller during the policy update.…”
Section: Contributionsmentioning
confidence: 99%
“…Hence Deep Q-learning is still a fundamental part of methods such as DDPG [10], A3C [16], TD3 [9], SAC [11] that are all related to our method. Alternative approaches that follows the AC paradigm but employ sample estimates of the Q-function can be found in [17], TRPO [3], PPO [4] and P3O [18].…”
Section: Related Workmentioning
confidence: 99%
“…Remark 3 (Picking the coefficient λ). Following Fakoor et al (2019), we pick λ = 1 − ESS for both the steps (18)(19). This relaxes the quadratic penalty if the new task is similar to the metatraining tasks ( ESS is large) and vice-versa.…”
Section: Adaptation To a New Taskmentioning
confidence: 99%
“…The off-policy updates in MQL are essential to exploiting this data. The coefficient of the proximal term in the adaptation-phase objective (18-19) using the effective sample size (ESS) is inspired from the recent work of Fakoor et al (2019).…”
Section: Related Workmentioning
confidence: 99%