2019
DOI: 10.48550/arxiv.1906.10306
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

Abstract: Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning. However, due to nonconvexity, the global convergence of PPO and TRPO remains less understood, which separates theory from practice. In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. The key to our… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
53
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 33 publications
(55 citation statements)
references
References 21 publications
1
53
0
Order By: Relevance
“…Recently, policy optimization (PO) has seen great success in many real-world applications, especially when coupled with deep neural networks (Silver et al, 2017;Duan et al, 2016;Wang et al, 2018), and a variety of PO based algorithms have been proposed (Williams, 1992;Kakade, 2001;Schulman et al, 2015Schulman et al, , 2017Konda and Tsitsiklis, 2000). The theoretical understandings of PO have also been studied in both computational (i.e., convergence) perspective (Liu et al, 2019;Wang et al, 2019a) and statistical (i.e., regret) perspective Efroni et al, 2020a). Thus, one fundamental question to ask is how to build on existing understandings of non-private PO algorithms to design sample-efficient policy-based RL algorithms with general privacy guarantees (e.g., JDP and LDP), which is the main motivation behind this work.…”
Section: Algorithmmentioning
confidence: 99%
“…Recently, policy optimization (PO) has seen great success in many real-world applications, especially when coupled with deep neural networks (Silver et al, 2017;Duan et al, 2016;Wang et al, 2018), and a variety of PO based algorithms have been proposed (Williams, 1992;Kakade, 2001;Schulman et al, 2015Schulman et al, , 2017Konda and Tsitsiklis, 2000). The theoretical understandings of PO have also been studied in both computational (i.e., convergence) perspective (Liu et al, 2019;Wang et al, 2019a) and statistical (i.e., regret) perspective Efroni et al, 2020a). Thus, one fundamental question to ask is how to build on existing understandings of non-private PO algorithms to design sample-efficient policy-based RL algorithms with general privacy guarantees (e.g., JDP and LDP), which is the main motivation behind this work.…”
Section: Algorithmmentioning
confidence: 99%
“…The concentrability coefficient commonly appears in the reinforcement learning literature (Szepesvári and Munos, 2005;Munos and Szepesvári, 2008;Antos et al, 2008;Farahmand et al, 2010;Scherrer et al, 2015;Farahmand et al, 2016;Lazaric et al, 2016;Liu et al, 2019;Wang et al, 2019). In contrast to a more standard concentrability coefficient form, note that κ is irrelevant to the update of the algorithm.…”
Section: Convergence Of Mean-field Ppomentioning
confidence: 99%
“…The classical theory of AC focuses on the case of linear function approximation, where the actor and critic are represented using linear functions with the feature mapping fixed throughout learning (Bhatnagar et al, 2008(Bhatnagar et al, , 2009Konda and Tsitsiklis, 2000). Meanwhile, a few recent works establish convergence and optimality of AC with overparameterized neural networks (Fu et al, 2020;Liu et al, 2019;Wang et al, 2019), where the neural network training is captured by the Neural Tangent Kernel (NTK) (Jacot et al, 2018). Specifically, with properly designed parameter initialization and stepsizes, and sufficiently large network widths, the neural networks employed by both actor and critic can be assumed to be well approximated by linear functions of a random feature vector.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, policy optimization (PO) has seen great success in many real-world applications, especially when coupled with function approximations (Silver et al, 2017;Duan et al, 2016;Wang et al, 2018), and a variety of PO based algorithms have been proposed (Williams, 1992;Kakade, 2001;Schulman et al, 2015Schulman et al, , 2017Konda and Tsitsiklis, 2000). The theoretical understandings of PO have also been studied in both computational (i.e., convergence) perspective (Liu et al, 2019; and statistical (i.e., regret) perspective Efroni et al, 2020). Unfortunately, all of these algorithm are non-private and thus a direct application of them on the above personalized services may lead to privacy concerns.…”
Section: Introductionmentioning
confidence: 99%