Lambda‐Policy Iteration: A Review and a New Implementation

Bertsekas, Dimitri P.

doi:10.1002/9781118453988.ch17

Cited by 16 publications

(14 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Corollary 1: Let {Q i (z, a)} generated by ( 10) and ( 11), and {Q i V I (z, a)} the sequence generated by the standard VI algorithm corresponding to setting H i = 1 for all i in (11) . Under assumption (12)…”

Section: Iterating Leads Tomentioning

confidence: 99%

“…1 All authors are with the Department of Information Technology and Electrical Engineering, ETH Zurich, Switzerland, {atanzana,jlygeros}@ethz.ch presented in [11]. A class of PI algorithms based on temporal difference learning and the λ-operator is proposed in [12], which has been further extended using abstract dynamic programming [13] and randomized proximal methods [14], [15]. An alternative family of model-based tabular PI algorithms with multi-step greedy policy improvement is derived in [16], [17].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Constrained Optimal Tracking Control of Unknown Systems: A Multi-Step Linear Programming Approach

Tanzanakis¹,

Lygeros²

2020

Preprint

View full text Add to dashboard Cite

We study the problem of optimal state-feedback tracking control for unknown discrete-time deterministic systems with input constraints. To handle input constraints, stateof-art methods utilize a certain nonquadratic stage cost function, which is sometimes limiting real systems. Furthermore, it is well known that Policy Iteration (PI) and Value Iteration (VI), two widely used algorithms in data-driven control, offer complementary strengths and weaknesses. In this work, a two-step transformation is employed, which converts the constrainedinput optimal tracking problem to an unconstrained augmented optimal regulation problem, and allows the consideration of general stage cost functions. Then, a novel multi-step VI algorithm based on Q-learning and linear programming is derived. The proposed algorithm improves the convergence speed of VI, avoids the requirement for an initial stabilizing control policy of PI, and computes a constrained optimal feedback controller without the knowledge of a system model and stage cost function. Simulation studies demonstrate the reliability and performance of the proposed approach.

show abstract

Section: Iterating Leads Tomentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Constrained Optimal Tracking Control of Unknown Systems: A Multi-Step Linear Programming Approach

Tanzanakis¹,

Lygeros²

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…The idea is utilizing the potential of the ADP/RL in handling stochastic processes [45], [53], [54], if the probability distribution functions of the delays and losses are known and using expected value operators in the Bellman equation. While the details are skipped due to the page constraints, interested reader are referred to the available studies both for conventional systems [55], [56] and NCSs [34].…”

Section: Extension To Ncs With Random Delay and Packet Lossmentioning

confidence: 99%

“…Therefore, instead of a recursion, one ends up with an equation to solve for the unknown value function. Motivated by the Value Iteration (VI) scheme in ADP/RL for solving conventional problems [45], [56], starting with a guess on V 0 (. ), for example V 0 (.)…”

Section: Extension To Infinite-horizon Problemsmentioning

confidence: 99%

Optimal Triggering of Networked Control Systems

Heydari

2017

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Abstract-The problem of resource allocation of nonlinear networked control systems is investigated, where, unlike the well discussed case of triggering for stability, the objective is optimal triggering. An approximate dynamic programming approach is developed for solving problems with fixed final times initially and then it is extended to infinite horizon problems. Different cases including Zero-Order-Hold, Generalized ZeroOrder-Hold, and stochastic networks are investigated. Afterwards, the developments are extended to the case of problems with unknown dynamics and a model-free scheme is presented for learning the (approximate) optimal solution. After detailed analyses of convergence, optimality, and stability of the results, the performance of the method is demonstrated through different numerical examples.

show abstract

“…which aims to converge to a fixed point of ΠP (c) . The algorithm may be based on simulation-based computations of ΠT (λ) x, and such computations have been discussed in the approximate DP context as part of the LSPE(λ) method (noted earlier), and the λ-policy iteration method (proposed in [BeI96], and further developed in the book [BeT96], and the papers [Ber12b] and [Sch13]). The simulation-based methods for computing ΠT (λ) x have been adapted to the more general linear equation context in [BeY07], [BeY09]; see also [Ber12a], Section 7.3.…”

Section: Introductionmentioning

confidence: 99%