2021
DOI: 10.1137/20m1382386
|View full text |Cite
|
Sign up to set email alerts
|

Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a Finite Horizon

Abstract: We explore reinforcement learning methods for finding the optimal policy in the linear quadratic regulator (LQR) problem. In particular we consider the convergence of policy gradient methods in the setting of known and unknown parameters. We are able to produce a global linear convergence guarantee for this approach in the setting of finite time horizon and stochastic state dynamics under weak assumptions. The convergence of a projected policy gradient method is also established in order to handle problems wit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
47
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 35 publications
(51 citation statements)
references
References 25 publications
0
47
0
Order By: Relevance
“…Subsequently, Malik et al [20] improve the sample efficiency but their result holds only with a fixed probability and thus does not seem applicable for our purposes. Hambly et al [13] also improve the sample efficiency, but in a finite horizon setting. Mohammadi et al [22] give sample complexity bounds for the continuous-time variant of LQR.…”
Section: Related Workmentioning
confidence: 99%
“…Subsequently, Malik et al [20] improve the sample efficiency but their result holds only with a fixed probability and thus does not seem applicable for our purposes. Hambly et al [13] also improve the sample efficiency, but in a finite horizon setting. Mohammadi et al [22] give sample complexity bounds for the continuous-time variant of LQR.…”
Section: Related Workmentioning
confidence: 99%
“…Despite the notable success of PGMs, a mathematical theory that guarantees the convergence of these algorithms for general (continuous time) stochastic control problems has been elusive. Analysing the convergence behavior of PGMs is technically challenging, as the objective of a control problem is typically nonconvex with respect to the policies, even in the LQ setting [10,13]. Most existing theoretical results of PGMs, especially those establishing (optimal) linear convergence, focus on discrete time problems and restrict policies within specific parametric families.…”
Section: Introductionmentioning
confidence: 99%
“…Most existing theoretical results of PGMs, especially those establishing (optimal) linear convergence, focus on discrete time problems and restrict policies within specific parametric families. This includes Markov decision problems (MDPs) with softmax parameterized policies [26] or overparametrized one-hidden-layer neural-network policies [38,11,19], and discrete time linear (LQ) control problems with linear parameterized policies [10,13]. The analysis therein exploits heavily the specific structure of the considered (discrete time) control problems and policy parameterizations, and hence is difficult to extend to general continuous time control problems or general policy parameterizations.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…For policy gradient methods, a sample complexity of O(ǫ −2 ) is known; see e.g., [30] for a modern analysis, and see also [28] for speedup. If more structure about the problem is known and one considers an exact version, one can sometimes obtain a geometric rate; for example, [11] gives a geometric rate for a version of policy gradient when applied to an LQR problem. Likewise, information about about the structure of the optimal policy in general was shown to be useful in [21].…”
mentioning
confidence: 99%