Convergence and optimality of policy gradient primal-dual method for constrained Markov decision processes

Ding, Dong-Sheng; Zhang, Kaiqing; Başar, Tamer; Jovanović, Mihailo R.

doi:10.23919/acc53348.2022.9867805

“…As a result, these methods aim to bound the ℓ 2 distance to the feasible set and the bounds scale with the ℓ 2 norm of the reward vector. This deviates from the more common, and perhaps more natural, formulation for constrained MDPs studied in other works [Efroni et al, 2020, Brantley et al, 2020, Ding et al, 2021. Here, each component of the loss vector is within a given range (e.g.…”

Section: Reinforcement Learning In Constrained Mdpsmentioning

confidence: 83%

Pseudonorm Approachability and Applications to Regret Minimization

Dann¹,

Mansour²,

Mohri³

et al. 2023

Preprint

0

View full text Add to dashboard Cite

Blackwell's celebrated approachability theory provides a general framework for a variety of learning problems, including regret minimization. However, Blackwell's proof and implicit algorithm measure approachability using the ℓ 2 (Euclidean) distance. We argue that in many applications such as regret minimization, it is more useful to study approachability under other distance metrics, most commonly the ℓ ∞ -metric. But, the time and space complexity of the algorithms designed for ℓ ∞ -approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. Thus, we present a framework for converting high-dimensional ℓ ∞ -approachability problems to low-dimensional pseudonorm approachability problems, thereby resolving such issues. We first show that the ℓ ∞ -distance between the average payoff and the approachability set can be equivalently defined as a pseudodistance between a lowerdimensional average vector payoff and a new convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for ℓ 2 and other norms, showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an ℓ ∞ -approachability algorithm whose convergence is independent of the dimension of the original vectorial payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original ℓ ∞ -distance can be computed efficiently. We also give an ℓ ∞ -approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer. Finally, we illustrate the benefits of our framework by applying it to several problems in regret minimization.

show abstract

“…Non-asymptotic analysis of (natural) policy gradient methods. Moving beyond tabular MDPs, finite-time convergence guarantees of PG / NPG methods and their variants have recently been studied for control problems (e.g., [18,19,44,58]), regularized MDPs (e.g., [11,24,54]), constrained MDPs (e.g., [15,50]), robust MDPs (e.g., [29,60]), MDPs with function approximation (e.g., [1,2,10,25,30,45]), Markov games (e.g., [13,14,46,49,61]), and their use in actor-critic methods (e.g., [3,12,48,51]).…”

Section: Other Related Workmentioning

confidence: 99%

Softmax policy gradient methods can take exponential time to converge

Li

¹

,

Wei

²

,

Chi

³

et al. 2023

View full text Add to dashboard Cite

The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For $$\gamma $$ γ -discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space $${\mathcal {S}}$$ S and the effective horizon $$\frac{1}{1-\gamma }$$ 1 1 - γ , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize $$\eta $$ η can take $$\begin{aligned} \frac{1}{\eta } |{\mathcal {S}}|^{2^{\Omega \big (\frac{1}{1-\gamma }\big )}} ~\text {iterations} \end{aligned}$$ 1 η | S | 2 Ω ( 1 1 - γ ) iterations to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.

show abstract

“…RL with constraints: First, constraints that require some expected cumulative costs over all steps to be bounded have been widely studied in safe RL [19,20,21,8,22,23,24,9,25,26,10,27,28,11,29,30]. Second, many other work, e.g., [31] and [32], studied budget constraints that will halt the learning process whenever the budget has run out of.…”

Section: Related Workmentioning

confidence: 99%

A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints

Shi¹,

Liang²,

Shroff³

2023

Preprint

View full text Add to dashboard Cite

In many applications of Reinforcement Learning (RL), it is critically important that the algorithm performs safely, such that instantaneous hard constraints are satisfied at each step, and unsafe states and actions are avoided. However, existing algorithms for "safe" RL are often designed under constraints that either require expected cumulative costs to be bounded or assume all states are safe. Thus, such algorithms could violate instantaneous hard constraints and traverse unsafe states (and actions) in practice. Therefore, in this paper, we develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions under instantaneous hard constraints and the linear mixture model. It not only achieves a regret Õ( dH 3 √ dK ∆c) that tightly matches the state-of-the-art regret in the setting with only unsafe actions and nearly matches that in the unconstrained setting, but is also safe at each step, where d is the feature-mapping dimension, K is the number of episodes, H is the number of steps in each episode, and ∆ c is a safety-related parameter. We also provide a lower bound Ω(max{dH, which indicates that the dependency on ∆ c is necessary. Further, both our algorithm design and regret analysis involve several novel ideas, which may be of independent interest. Recently, instantaneous hard constraints have been studied in theoretical machine learning. Specifically, [12] and [18] studied bandits with linear instantaneous constraints that require a linear safety value of the chosen action to be bounded at each step. However, it is well-known that bandits are only a very special case of MDP. [14] studied safe linear MDP with linear instantaneous hard constraints. However, they still assume that only the actions could be unsafe, and hence unsafe states (and transitions) are still not considered. Intuitively, when there are only unsafe actions, any action will

show abstract

Convergence and optimality of policy gradient primal-dual method for constrained Markov decision processes

Cited by 30 publications

References 6 publications

Pseudonorm Approachability and Applications to Regret Minimization

Pseudonorm Approachability and Applications to Regret Minimization

Softmax policy gradient methods can take exponential time to converge

A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints

Contact Info

Product

Resources

About