Proximal algorithms and temporal difference methods for solving fixed point problems

Bertsekas, Dimitri P.

doi:10.1007/s10589-018-9990-5

Cited by 6 publications

(6 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The choice of λ embodies the important bias-variance tradeoff: larger values of λ lead to better approximation of J µ , but require a larger number of simulation samples because of increased simulation noise (see the discussion in Section 6.3.6 of [Ber12]). An important insight is that the operator T (λ) µ is closely related to the proximal operator of convex analysis (with λ corresponding to the penalty parameter of the proximal operator), as shown in the author's paper [Ber16a] (see also the monograph [Ber18a], Section 1.2.5, and the paper [Ber18b]). In particular, TD(λ) can be viewed as a stochastic simulation-based version of the proximal algorithm.…”

Section: Indirect Methods Based On Projected Equationsmentioning

confidence: 99%

See 1 more Smart Citation

Feature-based aggregation and deep reinforcement learning: a survey and some new implementations

Bertsekas

2019

IEEE/CAA J. Autom. Sinica

Self Cite

View full text Add to dashboard Cite

In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural network-based reinforcement learning, thereby potentially leading to more effective policy improvement. † Dimitri Bertsekas is with the Dept. of Electr. Engineering and Comp. Science, and the Laboratory for Information and Decision Systems, M.I.T., Cambridge, Mass., 02139. A version of this paper will appear in IEEE/CAA Journal of Automatica Sinica.An important advantage of linear feature-based architectures is that given the form of the feature vector F (·), they can be trained with linear least squares-type methods. However, determining good features may be a challenge in general. Neural networks resolve this challenge through training that constructs automatically features and simultaneously combines the components of the features linearly with weights. This is commonly done by cost fitting/nonlinear regression, using a large number of state-cost sample pairs, which are processed through a sequence of alternately linear and nonlinear layers (see Section 3). The outputs of the final nonlinear layer are the features, which are then processed by a final linear layer that provides a linear combination of the features as a cost function approximation. The idea of representing cost functions in terms of features of the state in a context that we may now call "approximation in value space" or "approximate DP" goes back to the work of Shannon on chess [Sha50]. The work of Samuel [Sam59], [Sam67] on checkers extended some of Shannon's algorithmic schemes and introduced temporal difference ideas that motivated much subsequent research. The use of neural networks to simultaneously extract features of the optimal or the policy cost functions, and construct an approximation to these cost functions was also investigated in the early days of reinforcement learning; some of the original contributions that served as motivation for much subsequent work are Werbos [Wer77], Barto, Sutton, and Anderson [BSA83], Christensen and Korf [ChK86], Holland [Hol86], and Sutton [Sut88]. The use of a neural network as a cost function approximator for a challenging DP problem was first demonstrated impressively in the context of the game of b...

show abstract

Section: Indirect Methods Based On Projected Equationsmentioning

confidence: 99%

“…is a projected equation, which is related to the proximal algorithm [Ber16a], [Ber18b], and may be solved by using temporal differences. Thus we may use exploration-enhanced versions of the LSTD(λ) and LSPE(λ) methods in an approximate PI scheme to solve the λ-aggregation equation.…”

Section: λ-Aggregationmentioning

confidence: 99%

Feature-based aggregation and deep reinforcement learning: a survey and some new implementations

Bertsekas

2019

IEEE/CAA J. Autom. Sinica

Self Cite

View full text Add to dashboard Cite

show abstract

“…Except the cases in which U (x) is not singleton for finite number of x, which we refer to as trivial cases, M being finite implies state space being finite. Therefore, except the trivial cases, with the following finite policy assumption, the λ operator T (λ) µ is ensured to be well-posed (see (Bertsekas, 2018b, Proposition 2.1)), and the monotonicity of the underlying operator H is not required for the desired behavior.…”

Section: λ-Pirmentioning

confidence: 99%

“…A survey can be found in Bertsekas (2012). Most recently, the connection between TD(λ) and proximal algorithms, which are widely used for solving convex optimization problems, is discussed in Bertsekas (2018b). In light of such relation, λ-PI with randomization (λ-PIR) was proposed in (Bertsekas, 2018a, Chaper 2).…”

Section: Introductionmentioning

confidence: 99%

Lambda-Policy Iteration with Randomization for Contractive Models with Infinite Policies: Well-Posedness and Convergence (Extended Version)

Li,

Johansson,

Mårtensson

2019

Preprint

View full text Add to dashboard Cite

dynamic programming models are used to analyze λ-policy iteration with randomization algorithms. Particularly, contractive models with infinite policies are considered and it is shown that well-posedness of the λ-operator plays a central role in the algorithm. The operator is known to be well-posed for problems with finite states, but our analysis shows that it is also well-defined for the contractive models with infinite states studied. Similarly, the algorithm we analyze is known to converge for problems with finite policies, but we identify the conditions required to guarantee convergence with probability one when the policy space is infinite regardless of the number of states. Guided by the analysis, we exemplify a data-driven approximated implementation of the algorithm for estimation of optimal costs of constrained linear and nonlinear control problems. Numerical results indicate potentials of this method in practice.

show abstract

“…1 All authors are with the Department of Information Technology and Electrical Engineering, ETH Zurich, Switzerland, {atanzana,jlygeros}@ethz.ch presented in [11]. A class of PI algorithms based on temporal difference learning and the λ-operator is proposed in [12], which has been further extended using abstract dynamic programming [13] and randomized proximal methods [14], [15]. An alternative family of model-based tabular PI algorithms with multi-step greedy policy improvement is derived in [16], [17].…”

Section: Introductionmentioning

confidence: 99%

Constrained Optimal Tracking Control of Unknown Systems: A Multi-Step Linear Programming Approach

Tanzanakis¹,

Lygeros²

2020

Preprint

View full text Add to dashboard Cite

We study the problem of optimal state-feedback tracking control for unknown discrete-time deterministic systems with input constraints. To handle input constraints, stateof-art methods utilize a certain nonquadratic stage cost function, which is sometimes limiting real systems. Furthermore, it is well known that Policy Iteration (PI) and Value Iteration (VI), two widely used algorithms in data-driven control, offer complementary strengths and weaknesses. In this work, a two-step transformation is employed, which converts the constrainedinput optimal tracking problem to an unconstrained augmented optimal regulation problem, and allows the consideration of general stage cost functions. Then, a novel multi-step VI algorithm based on Q-learning and linear programming is derived. The proposed algorithm improves the convergence speed of VI, avoids the requirement for an initial stabilizing control policy of PI, and computes a constrained optimal feedback controller without the knowledge of a system model and stage cost function. Simulation studies demonstrate the reliability and performance of the proposed approach.

show abstract

Proximal algorithms and temporal difference methods for solving fixed point problems

Abstract: Proximal algorithms and temporal difference methods for solving fixed point problemsThe MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

Cited by 6 publications

References 56 publications

Feature-based aggregation and deep reinforcement learning: a survey and some new implementations

Feature-based aggregation and deep reinforcement learning: a survey and some new implementations

Lambda-Policy Iteration with Randomization for Contractive Models with Infinite Policies: Well-Posedness and Convergence (Extended Version)

Constrained Optimal Tracking Control of Unknown Systems: A Multi-Step Linear Programming Approach

Contact Info

Product

Resources

About