2022
DOI: 10.48550/arxiv.2203.12679
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Sample-efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

Abstract: Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-toend model-based gradient descent framework. This approach has proven effective for optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of sampled trajectories to learn effectively and can suffer from high variance in solution quality. In this work, we revisit … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 19 publications
0
1
0
Order By: Relevance
“…where superscript m denotes the dependence of the corresponding term on the previous policy estimate µ m (e.g., V m is the value function of the policy µ θ m , and d m is the corresponding state occupancy measure). We can now eliminate 1 Supplementary material provided in the arXiv version of this paper (Low, Kumar, and Sanner 2022) the equality constraints ( 13)-( 14) by substituting them directly into the objective to get the final simplified problem:…”
Section: Maximization Step In MMmentioning
confidence: 99%
“…where superscript m denotes the dependence of the corresponding term on the previous policy estimate µ m (e.g., V m is the value function of the policy µ θ m , and d m is the corresponding state occupancy measure). We can now eliminate 1 Supplementary material provided in the arXiv version of this paper (Low, Kumar, and Sanner 2022) the equality constraints ( 13)-( 14) by substituting them directly into the objective to get the final simplified problem:…”
Section: Maximization Step In MMmentioning
confidence: 99%