Global Optimality Guarantees For Policy Gradient Methods

Bhandari, Jalaj; Russo, Daniel

doi:10.48550/arxiv.1906.01786

Cited by 67 publications

(108 citation statements)

References 18 publications

(25 reference statements)

Supporting

Mentioning

104

Contrasting

Order By: Relevance

“…For more examples of sample complexity analysis for convergence to a stationary point, see for example [162,225,226,183,222]. The global optimality of stationary points was studied in [23] where they identified certain situations under which the policy gradient objective function has no sub-optimal stationary points despite being non-convex.…”

Section: Discussionmentioning

confidence: 99%

Recent Advances in Reinforcement Learning in Finance

Hambly,

Xu,

Yang

2021

Preprint

View full text Add to dashboard Cite

The rapid changes in the finance industry due to the increasing amount of data have revolutionized the techniques on data processing and data analysis and brought new theoretical and computational challenges. In contrast to classical stochastic control theory and other analytical approaches for solving financial decision-making problems that heavily reply on model assumptions, new developments from reinforcement learning (RL) are able to make full use of the large amount of financial data with fewer model assumptions and to improve decisions in complex financial environments. This survey paper aims to review the recent developments and use of RL approaches in finance. We give an introduction to Markov decision processes, which is the setting for many of the commonly used RL approaches. Various algorithms are then introduced with a focus on value and policy based methods that do not require any model assumptions. Connections are made with neural networks to extend the framework to encompass deep RL algorithms. Our survey concludes by discussing the application of these RL algorithms in a variety of decision-making problems in finance, including optimal execution, portfolio optimization, option pricing and hedging, market making, smart order routing, and robo-advising.

show abstract

Section: Discussionmentioning

confidence: 99%

Recent Advances in Reinforcement Learning in Finance

Hambly,

Xu,

Yang

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In this work we are interested in estimating ∇H(d θ ) because it is essential for estimating ∇ρ(θ) [cf. (9)]. It is important to note, however, that Theorem 2 and Corollary 1 are of independent interest.…”

Section: Entropy and Oir Policy Gradient Theoremsmentioning

confidence: 99%

“…Fix a policy parameter iterate θ t at timestep t. The gradient ∇ρ(θ t ) [cf. (9)] with respect to the policy parameters θ of the OIR ρ(θ) [cf. (6)] evaluated at θ = θ t satisfies…”

Section: Entropy and Oir Policy Gradient Theoremsmentioning

confidence: 99%

“…To enable operation in high-dimensional, possibly continuous spaces, operating in parameter space rather than tabular representations is required, for which policy gradient methods are most natural [27,37,17]. In addition, recent theoretical progress has been made in providing global optimality guarantees for policy gradient methods even in the parameterized setting, under suitable conditions [9,2,29,48,7].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

Suttle¹,

Koppel²,

Liu³

2022

Preprint

View full text Add to dashboard Cite

We develop a new measure of the exploration/exploitation trade-off in infinite-horizon reinforcement learning problems called the occupancy information ratio (OIR), which is comprised of a ratio between the infinite-horizon average cost of a policy and the entropy of its long-term state occupancy measure. The OIR ensures that no matter how many trajectories an RL agent traverses or how well it learns to minimize cost, it maintains a healthy skepticism about its environment, in that it defines an optimal policy which induces a high-entropy occupancy measure. Different from earlier information ratio notions, OIR is amenable to direct policy search over parameterized families, and exhibits hidden quasiconcavity through invocation of the perspective transformation. This feature ensures that under appropriate policy parameterizations, the OIR optimization problem has no spurious stationary points, despite the overall problem's nonconvexity. We develop for the first time policy gradient and actor-critic algorithms for OIR optimization based upon a new entropy gradient theorem, and establish both asymptotic and nonasymptotic convergence results with global optimality guarantees. In experiments, these methodologies outperform several deep RL baselines in problems with sparse rewards, where many trajectories may be uninformative and skepticism about the environment is crucial to success.

show abstract

“…To facilitate the understanding of theoretical aspects of policy gradient methods, canonical control problems of linear time-invariant (LTI) systems have been commonly used as benchmarks [8]- [12]. In particular, the linear quadratic regulator (LQR), one of the most fundamental optimal control problems, has recently regained significant research interest [8]- [11].…”

Section: Introductionmentioning

confidence: 99%

On the Optimization Landscape of Dynamic Output Feedback Linear Quadratic Control

Duan¹,

Cao²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

The optimization landscape of optimal control problems plays an important role in the convergence of many policy gradient methods. Unlike state-feedback Linear Quadratic Regulator (LQR), static output-feedback policies are typically insufficient to achieve good closed-loop control performance. We investigate the optimization landscape of linear quadratic control using dynamic output-feedback policies, denoted as dynamic LQR (dLQR) in this paper. We first show that the dLQR cost varies with similarity transformations. We then derive an explicit form of the optimal similarity transformation for a given observable stabilizing controller. We further characterize the unique observable stationary point of dLQR. This provides an optimality certificate for policy gradient methods under mild assumptions. Finally, we discuss the differences and connections between dLQR and the canonical linear quadratic Gaussian (LQG) control. These results shed light on designing policy gradient algorithms for decisionmaking problems with partially observed information.

show abstract

Global Optimality Guarantees For Policy Gradient Methods

Cited by 67 publications

References 18 publications

Recent Advances in Reinforcement Learning in Finance

Recent Advances in Reinforcement Learning in Finance

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

On the Optimization Landscape of Dynamic Output Feedback Linear Quadratic Control

Contact Info

Product

Resources

About