Policy iteration for Hamilton–Jacobi–Bellman equations with control constraints

Kundu, Sudeep; Kunisch, Karl

doi:10.1007/s10589-021-00278-3

Cited by 5 publications

(3 citation statements)

References 37 publications

(50 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reducing the size of the constraint box, the difference between the no-gradient regression and choosing the best λ increases, confirming that the gradient cross achieves a better result for the constrained case in presence of information of both the value function and its gradient. The same example has been considered in [32]. We fix σ = 10, β = 8/3 , ρ = 2 and (x(0), y(0), z(0)) = (−1, −1, −1).…”

Section: Optimal Controlmentioning

confidence: 99%

Data-driven Tensor Train Gradient Cross Approximation for Hamilton-Jacobi-Bellman Equations

Dolgov¹,

Kalise²,

Saluzzi³

2022

Preprint

View full text Add to dashboard Cite

A gradient-enhanced functional tensor train cross approximation method for the resolution of the Hamilton-Jacobi-Bellman (HJB) equations associated to optimal feedback control of nonlinear dynamics is presented. The procedure uses samples of both the solution of the HJB equation and its gradient to obtain a tensor train approximation of the value function. The collection of the data for the algorithm is based on two possible techniques: Pontryagin Maximum Principle and State-Dependent Riccati Equations. Several numerical tests are presented in low and high dimension showing the effectiveness of the proposed method and its robustness with respect to inexact data evaluations, provided by the gradient information. The resulting tensor train approximation paves the way towards fast synthesis of the control signal in real-time applications.

show abstract

Section: Optimal Controlmentioning

confidence: 99%

Data-driven Tensor Train Gradient Cross Approximation for Hamilton-Jacobi-Bellman Equations

Dolgov¹,

Kalise²,

Saluzzi³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Analogously to [41] one can also incorporate control constraints in terms of projection operators. The generalization of the present approach to stochastic control problems and finite horizon problems is discussed in [21,53].…”

Section: Introductionmentioning

confidence: 99%

Approximating the Stationary Bellman Equation by Hierarchical Tensor Products

Oster

Sallandt

Schneider

2024

JCM

View full text Add to dashboard Cite

We treat infinite horizon optimal control problems by solving the associated stationary Bellman equation numerically to compute the value function and an optimal feedback law. The dynamical systems under consideration are spatial discretizations of non linear parabolic partial differential equations (PDE), which means that the Bellman equation suffers from the curse of dimensionality. Its non linearity is handled by the Policy Iteration algorithm, where the problem is reduced to a sequence of linear equations, which remain the computational bottleneck due to their high dimensions. We reformulate the linearized Bellman equations via the Koopman operator into an operator equation, that is solved using a minimal residual method. Using the Koopman operator we identify a preconditioner for operator equation, which deems essential in our numerical tests. To overcome computational infeasability we use low rank hierarchical tensor product approximation/tree-based tensor formats, in particular tensor trains (TT tensors) and multi-polynomials, together with high-dimensional quadrature, e.g. Monte-Carlo. By controlling a destabilized version of viscous Burgers and a diffusion equation with unstable reaction term numerical evidence is given.

show abstract

“…8) see Fig.3.8, which illustrates that PI can be viewed as Newton's method for solving the Bellman equation in the function space of cost functions J.The interpretation of PI as a form of Newton's method has a long history, for which we refer to the original papers byKleinman [Klei68] for linear quadratic problems, and by Pollatschek and Avi-Itzhak[PoA69] for the finite-state discounted and Markov game cases. Subsequent works, which address broader classes of problems and algorithmic variations, include (among others) Hewer[Hew71], Puterman and Brumelle[PuB78],[PuB79], Saridis and Lee[SaL79] (following Rekasius[Rek64]), Beard[Bea95], Beard, Saridis, and Wen [BSW99], Santos and Rust[SaR04], Bokanowski, Maroso, and Zidani [BMZ09], Hylla[Hyl11], Magirou, Vassalos, and Barakitis [MVB20], Bertsekas[Ber21c], and Kundu and Kunitsch[KuK21]. Some of these papers include superlinear convergence rate results.RolloutGenerally, rollout with base policy µ can be viewed as a single iteration of Newton's method starting from J µ , as applied to the solution of the Bellman equation (see Fig.3.8).…”

mentioning

confidence: 99%

Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control

Bertsekas¹

2021

Preprint

View full text Add to dashboard Cite

Some of the most exciting success stories in reinforcement learning have been in the area of games. Primary examples are the recent AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon). These programs were trained off-line extensively using sophisticated approximate policy iteration algorithms and neural networks. Yet the AlphaZero player that has been obtained off-line is not used directly during on-line play. Instead a separate on-line player is used, which is based on multistep lookahead and a terminal cost that was trained using experience with the off-line player. The on-line player has greatly improved performance. Similarly, TD-Gammon computed off-line a terminal cost function approximation, which was used to extend its on-line lookahead by rollout (simulation with the one-step lookahead player that is based on the terminal cost function approximation). In particular:(a) The on-line player of AlphaZero plays much better than its extensively trained off-line player. This is due to the beneficial effect of approximation in value space with long lookahead minimization, which corrects for the inevitable imperfections of the off-line player, and its terminal cost approximation.(b) The TD-Gammon player that uses long rollout plays much better than TD-Gammon with one-step or two-step lookahead without rollout. This is due to the beneficial effect of the rollout, which serves as a substitute for long lookahead minimization.An important lesson from AlphaZero and TD-Gammon is that performance of an off-line trained controller can be greatly improved by on-line approximation in value space, with long lookahead (whether involving minimization or rollout with an off-line obtained policy), and terminal cost approximation that is obtained off-line. This performance enhancement is often dramatic and is due to a simple fact, which is the focal point of this paper: approximation in value space amounts to a step of Newton's method for solving Bellman's equation, while the starting point for the Newton step is based on the results of off-line training and may be enhanced by longer lookahead and on-line rollout . This process can be understood in terms of abstract models of infinite horizon dynamic programming and simple geometrical constructions. It manifests itself to some extent in model predictive control, but it seems that it has yet to be fully appreciated within † Early draft of a research monograph to be published by Athena Scientific, Belmont, MA, sometime in 2022.The complete monograph will include expanded versions of Sections 4 and 5, and a treatment of finite horizon and discrete optimization problems.

show abstract

Policy iteration for Hamilton–Jacobi–Bellman equations with control constraints

Cited by 5 publications

References 37 publications

Data-driven Tensor Train Gradient Cross Approximation for Hamilton-Jacobi-Bellman Equations

Data-driven Tensor Train Gradient Cross Approximation for Hamilton-Jacobi-Bellman Equations

Approximating the Stationary Bellman Equation by Hierarchical Tensor Products

Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control

Contact Info

Product

Resources

About