On the Performance of Planning Through Backpropagation

Scaroni, Renato; Bueno, Thiago Pereira; Barros, Leliane Nunes de; Mauá, Denis Deratani

doi:10.1007/978-3-030-61380-8_8

Cited by 3 publications

(5 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the past decade, deep learning (DL) methods have demonstrated remarkable success in a variety of complex applications in computer vision, natural language, and signal processing (Krizhevsky, Sutskever, and Hinton 2017;Hinton et al 2012;Bengio, Lecun, and Hinton 2021). More recently, a variety of work has sought to leverage DL tools for planning and policy learning in a large variety of deterministic and stochastic decision-making domains (Wu, Say, and Sanner 2017;Wu, Say, and Sanner 2020;Say et al 2020;Scaroni et al 2020;Say 2021;Toyer et al 2020;Garg, Bajpai, and Mausam 2020).…”

Section: Introductionmentioning

confidence: 99%

Sample-efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

Low¹,

Kumar²,

Sanner³

2022

Preprint

View full text Add to dashboard Cite

Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-toend model-based gradient descent framework. This approach has proven effective for optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of sampled trajectories to learn effectively and can suffer from high variance in solution quality. In this work, we revisit the overall model-based DRP objective and instead take a minorizationmaximization perspective to iteratively optimize the DRP w.r.t. a locally tight lower-bounded objective. This novel formulation of DRP learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective, (ii) it guarantees a monotonically improving objective under certain theoretical conditions, and (iii) it reuses samples between iterations thus lowering sample complexity. Empirical evaluation confirms that ILBO is significantly more sampleefficient than the state-of-the-art DRP planner and consistently produces better solution quality with lower variance. We additionally demonstrate that ILBO generalizes well to new problem instances (i.e., different initial states) without requiring retraining.

show abstract

Section: Introductionmentioning

confidence: 99%

Sample-efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

Low¹,

Kumar²,

Sanner³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In the past decade, deep learning (DL) methods have demonstrated remarkable success in a variety of complex applications in computer vision, natural language, and signal processing (Krizhevsky, Sutskever, and Hinton 2017;Hinton et al 2012;Bengio, Lecun, and Hinton 2021). More recently, a variety of work has sought to leverage DL tools for planning and policy learning in a large variety of deterministic and stochastic decision-making domains (Wu, Say, and Sanner 2017;Bueno et al 2019;Wu, Say, and Sanner 2020;Say et al 2020;Scaroni et al 2020;Say 2021;Toyer et al 2020;Garg, Bajpai, and Mausam 2020).…”

Section: Introductionmentioning

confidence: 99%

“…However, a recent direction of significant influence on the present work is the use of automatic differentiation in an end-to-end model-based gradient descent framework to leverage recent advances in non-convex optimization from DL. The majority of work in this direction has focused on deterministic continuous planning models -both known (Wu, Say, and Sanner 2017;Scaroni et al 2020) and learned (Wu, Say, and Sanner 2020;Say 2021). However, in this work we are specifically concerned with learning deep reactive policies (DRPs) for fast decision-making in general continuous state-action MDPs (CSA-MDPs).…”

Section: Introductionmentioning

confidence: 99%

Sample-Efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

Low¹,

Kumar²,

Sanner³

2022

AAAI

View full text Add to dashboard Cite

Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-to-end model-based gradient descent framework. This approach has proven effective for optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of sampled trajectories to learn effectively and can suffer from high variance in solution quality. In this work, we revisit the overall model-based DRP objective and instead take a minorization-maximization perspective to iteratively optimize the DRP w.r.t. a locally tight lower-bounded objective. This novel formulation of DRP learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective, (ii) it guarantees a monotonically improving objective under certain theoretical conditions, and (iii) it reuses samples between iterations thus lowering sample complexity. Empirical evaluation confirms that ILBO is significantly more sample-efficient than the state-of-the-art DRP planner and consistently produces better solution quality with lower variance. We additionally demonstrate that ILBO generalizes well to new problem instances (i.e., different initial states) without requiring retraining.

show abstract

“…In addition, we proposed a different formulation planning through backpropagation as trajectory optimization thus making clear the distinction between learning internal representations in Recurrent Neural Networks (RNNs) (Goodfellow et al, 2016) and optimizing trajectories (either action trajectories, i.e., plans, in the shooting formulation of Definition 4.1.2 or state-action trajectories in direct transcription of Definition 4.1.3). Preliminary results of our formulation and an analysis of the optimality gap has been recently published (Scaroni et al, 2020).…”

Section: Discussionmentioning

confidence: 99%

“…Inspired by the recurrent computation of Recurrent Neural Networks (RNN), TensorPlan leverages the backpropagation-through-time technique to optimize the model inputs (i.e., the agent's actions) instead of the internal neural representations. In this thesis, we reinterpret TensorPlan and propose to formulate planning through backpropagation as trajectory optimization (Scaroni et al, 2020). We remark that this reinterpretation makes interesting connections to control theory and expands the understanding of differentiable planning as general gradient-based methods that can be built on top of different optimization formulations which may lead to novel algorithms in the future.…”

Section: Thesis Proposal and Contributionsmentioning

confidence: 99%

Planning in stochastic computation graphs: solving stochastic nonlinear problems with backpropagation

Bueno¹

View full text Add to dashboard Cite

On the Performance of Planning Through Backpropagation

Cited by 3 publications

References 7 publications

Sample-efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

Sample-efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

Sample-Efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

Planning in stochastic computation graphs: solving stochastic nonlinear problems with backpropagation

Contact Info

Product

Resources

About