Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

Liu, Boyi; Cai, Qi; Yang, Zhuoran; Wang, Zhaoran

doi:10.48550/arxiv.1906.10306

Cited by 33 publications

(55 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, policy optimization (PO) has seen great success in many real-world applications, especially when coupled with deep neural networks (Silver et al, 2017;Duan et al, 2016;Wang et al, 2018), and a variety of PO based algorithms have been proposed (Williams, 1992;Kakade, 2001;Schulman et al, 2015Schulman et al, , 2017Konda and Tsitsiklis, 2000). The theoretical understandings of PO have also been studied in both computational (i.e., convergence) perspective (Liu et al, 2019;Wang et al, 2019a) and statistical (i.e., regret) perspective Efroni et al, 2020a). Thus, one fundamental question to ask is how to build on existing understandings of non-private PO algorithms to design sample-efficient policy-based RL algorithms with general privacy guarantees (e.g., JDP and LDP), which is the main motivation behind this work.…”

Section: Algorithmmentioning

confidence: 99%

Differentially Private Regret Minimization in Episodic Markov Decision Processes

Chowdhury¹,

Zhou²

2021

Preprint

View full text Add to dashboard Cite

We study regret minimization in finite horizon tabular Markov decision processes (MDPs) under the constraints of differential privacy (DP). This is motivated by the widespread applications of reinforcement learning (RL) in real-world sequential decision making problems, where protecting users' sensitive and private information is becoming paramount. We consider two variants of DP -joint DP (JDP), where a centralized agent is responsible for protecting users' sensitive data and local DP (LDP), where information needs to be protected directly on the user side. We first propose two general frameworks -one for policy optimization and another for value iteration -for designing private, optimistic RL algorithms. We then instantiate these frameworks with suitable privacy mechanisms to satisfy JDP and LDP requirements, and simultaneously obtain sublinear regret guarantees. The regret bounds show that under JDP, the cost of privacy is only a lower order additive term, while for a stronger privacy protection under LDP, the cost suffered is multiplicative. Finally, the regret bounds are obtained by a unified analysis, which, we believe, can be extended beyond tabular MDPs.

show abstract

Section: Algorithmmentioning

confidence: 99%

Differentially Private Regret Minimization in Episodic Markov Decision Processes

Chowdhury¹,

Zhou²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The concentrability coefficient commonly appears in the reinforcement learning literature (Szepesvári and Munos, 2005;Munos and Szepesvári, 2008;Antos et al, 2008;Farahmand et al, 2010;Scherrer et al, 2015;Farahmand et al, 2016;Lazaric et al, 2016;Liu et al, 2019;Wang et al, 2019). In contrast to a more standard concentrability coefficient form, note that κ is irrelevant to the update of the algorithm.…”

Section: Convergence Of Mean-field Ppomentioning

confidence: 99%

“…The classical theory of AC focuses on the case of linear function approximation, where the actor and critic are represented using linear functions with the feature mapping fixed throughout learning (Bhatnagar et al, 2008(Bhatnagar et al, , 2009Konda and Tsitsiklis, 2000). Meanwhile, a few recent works establish convergence and optimality of AC with overparameterized neural networks (Fu et al, 2020;Liu et al, 2019;Wang et al, 2019), where the neural network training is captured by the Neural Tangent Kernel (NTK) (Jacot et al, 2018). Specifically, with properly designed parameter initialization and stepsizes, and sufficiently large network widths, the neural networks employed by both actor and critic can be assumed to be well approximated by linear functions of a random feature vector.…”

Section: Introductionmentioning

confidence: 99%

“…More recently, using more sophisticated optimization techniques, various works (Wu et al, 2020;Xu et al, 2020b,a;Hong et al, 2020;Khodadadian et al, 2021) have established discrete-time convergence guarantees that show that linear AC converges sublinearly to either a stationary point or the globally optimal policy. Furthermore, when overparameterized neural networks are employed, Liu et al (2019); Wang et al (2019) andFu et al (2020) prove that neural AC converges to the global optimum at a sublinear rate. In these works, the initial value of the network parameters and the learning rates are chosen such that both actor and critic updates are captured by the NTK.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic

Zhang

Chen²,

Yang

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems. In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. Specifically, we consider a version of AC where the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. In the continuous-time and infinite-width limiting regime, when the timescales are properly separated, we prove that neural AC finds the globally optimal policy at a sublinear rate. Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.

show abstract

“…Recently, policy optimization (PO) has seen great success in many real-world applications, especially when coupled with function approximations (Silver et al, 2017;Duan et al, 2016;Wang et al, 2018), and a variety of PO based algorithms have been proposed (Williams, 1992;Kakade, 2001;Schulman et al, 2015Schulman et al, , 2017Konda and Tsitsiklis, 2000). The theoretical understandings of PO have also been studied in both computational (i.e., convergence) perspective (Liu et al, 2019; and statistical (i.e., regret) perspective Efroni et al, 2020). Unfortunately, all of these algorithm are non-private and thus a direct application of them on the above personalized services may lead to privacy concerns.…”

Section: Introductionmentioning

confidence: 99%

Differentially Private Reinforcement Learning with Linear Function Approximation

Zhou¹

2022

Preprint

View full text Add to dashboard Cite

Motivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite horizon Markov decision processes (MDPs) under the constraints of differential privacy (DP). Compared to existing private RL algorithms that work only on tabular finite-state, finite-actions MDPs, we take the first step towards privacy-preserving learning in MDPs with large state and action spaces. Specifically, we consider MDPs with linear function approximation (in particular linear mixture MDPs) under the notion of joint differential privacy (JDP), where the RL agent is responsible for protecting users' sensitive data. We design two private RL algorithms that are based on value iteration and policy optimization, respectively, and show that they enjoy sub-linear regret performance while guaranteeing privacy protection. Moreover, the regret bounds are independent of the number of states, and scale at most logarithmically with the number of actions, making the algorithms suitable for privacy protection in nowadays large scale personalized services. Our results are achieved via a general procedure for learning in linear mixture MDPs under changing regularizers, which not only generalizes previous results for non-private learning, but also serves as a building block for general private reinforcement learning.

show abstract

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

Cited by 33 publications

References 21 publications

Differentially Private Regret Minimization in Episodic Markov Decision Processes

Differentially Private Regret Minimization in Episodic Markov Decision Processes

Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic

Differentially Private Reinforcement Learning with Linear Function Approximation

Contact Info

Product

Resources

About