Our system is currently under heavy load due to increased usage. We're actively working on upgrades to improve performance. Thank you for your patience.
Reinforcement Learning and Approximate Dynamic Programming for Feedback Control 2012
DOI: 10.1002/9781118453988.ch17
|View full text |Cite
|
Sign up to set email alerts
|

Lambda‐Policy Iteration: A Review and a New Implementation

Abstract: In this paper we discuss λ-policy iteration, a method for exact and approximate dynamic programming. It is intermediate between the classical value iteration (VI) and policy iteration (PI) methods, and it is closely related to optimistic (also known as modified) PI, whereby each policy evaluation is done approximately, using a finite number of VI. We review the theory of the method and associated questions of bias and exploration arising in simulation-based cost function approximation. We then discuss various … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2013
2013
2020
2020

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 16 publications
(14 citation statements)
references
References 47 publications
0
11
0
Order By: Relevance
“…Corollary 1: Let {Q i (z, a)} generated by ( 10) and ( 11), and {Q i V I (z, a)} the sequence generated by the standard VI algorithm corresponding to setting H i = 1 for all i in (11) . Under assumption (12)…”
Section: Iterating Leads Tomentioning
confidence: 99%
See 1 more Smart Citation
“…Corollary 1: Let {Q i (z, a)} generated by ( 10) and ( 11), and {Q i V I (z, a)} the sequence generated by the standard VI algorithm corresponding to setting H i = 1 for all i in (11) . Under assumption (12)…”
Section: Iterating Leads Tomentioning
confidence: 99%
“…1 All authors are with the Department of Information Technology and Electrical Engineering, ETH Zurich, Switzerland, {atanzana,jlygeros}@ethz.ch presented in [11]. A class of PI algorithms based on temporal difference learning and the λ-operator is proposed in [12], which has been further extended using abstract dynamic programming [13] and randomized proximal methods [14], [15]. An alternative family of model-based tabular PI algorithms with multi-step greedy policy improvement is derived in [16], [17].…”
Section: Introductionmentioning
confidence: 99%
“…The idea is utilizing the potential of the ADP/RL in handling stochastic processes [45], [53], [54], if the probability distribution functions of the delays and losses are known and using expected value operators in the Bellman equation. While the details are skipped due to the page constraints, interested reader are referred to the available studies both for conventional systems [55], [56] and NCSs [34].…”
Section: Extension To Ncs With Random Delay and Packet Lossmentioning
confidence: 99%
“…Therefore, instead of a recursion, one ends up with an equation to solve for the unknown value function. Motivated by the Value Iteration (VI) scheme in ADP/RL for solving conventional problems [45], [56], starting with a guess on V 0 (. ), for example V 0 (.)…”
Section: Extension To Infinite-horizon Problemsmentioning
confidence: 99%
“…which aims to converge to a fixed point of ΠP (c) . The algorithm may be based on simulation-based computations of ΠT (λ) x, and such computations have been discussed in the approximate DP context as part of the LSPE(λ) method (noted earlier), and the λ-policy iteration method (proposed in [BeI96], and further developed in the book [BeT96], and the papers [Ber12b] and [Sch13]). The simulation-based methods for computing ΠT (λ) x have been adapted to the more general linear equation context in [BeY07], [BeY09]; see also [Ber12a], Section 7.3.…”
Section: Introductionmentioning
confidence: 99%