Safe Policies for Reinforcement Learning via Primal-Dual Methods

Paternain, Santiago; Calvo-Fullana, Miguel; Chamon, Luiz F. O.; Ribeiro, Alejandro

doi:10.1109/tac.2022.3152724

Cited by 33 publications

(15 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…RL with constraints: First, constraints that require some expected cumulative costs over all steps to be bounded have been widely studied in safe RL [19,20,21,8,22,23,24,9,25,26,10,27,28,11,29,30]. Second, many other work, e.g., [31] and [32], studied budget constraints that will halt the learning process whenever the budget has run out of.…”

Section: Related Workmentioning

confidence: 99%

“…in Step-3 of Algorithm 1, we update the regularized least-square estimator of the parameter w * h in (11) as follows:…”

Section: A Near-optimal Safe Algorithmmentioning

confidence: 99%

“…Many RL algorithms that do not consider any constraint (and hence are allowed to freely explore any state-action pair) with sample-complexity guarantees have been proposed in the literature [1,2,3,4,5,6,7]. Moreover, existing "safe" RL algorithms are usually designed under the constraint that requires expected cumulative, i.e., not instantaneous, costs over all steps to be bounded [8,9,10,11] (please see more related work in Section 1.2). Thus, practical scenarios where unsafe states and actions must be avoided at each time/step are not captured.…”

Section: Introductionmentioning

confidence: 99%

“…(ii-a-2-I) Subcase-2-I: If the optimal action a * h (s) has been found in Ãk h (s) by π k , i.e., a * h (s) ∈ Ãk h (s), we haveV k h (s) = max a∈ Ãk h (s) Q k h (s, a) = Q k h (s, a * h (s)|V k h+1 ) ≥ Q * h (s, a * h (s)|V k h+1 ) ≥ Q * h (s, a * h (s)) = V * h (s),(59)where the second inequality is because of the definition of V k h (s) in (23) and the induction hypothesis of invariant (ii) at step h 0 . Recal from(11) that Q * h (s, a) = r h (s, a) + w * h , φ V * h+1 (s, a), which depends on the V -value V * h+1 at next step. Thus, we write such a dependency explicitly for Q k h and Q * h in (59).…”

mentioning

confidence: 99%

See 3 more Smart Citations

A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints

Shi¹,

Liang²,

Shroff³

2023

Preprint

View full text Add to dashboard Cite

In many applications of Reinforcement Learning (RL), it is critically important that the algorithm performs safely, such that instantaneous hard constraints are satisfied at each step, and unsafe states and actions are avoided. However, existing algorithms for "safe" RL are often designed under constraints that either require expected cumulative costs to be bounded or assume all states are safe. Thus, such algorithms could violate instantaneous hard constraints and traverse unsafe states (and actions) in practice. Therefore, in this paper, we develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions under instantaneous hard constraints and the linear mixture model. It not only achieves a regret Õ( dH 3 √ dK ∆c) that tightly matches the state-of-the-art regret in the setting with only unsafe actions and nearly matches that in the unconstrained setting, but is also safe at each step, where d is the feature-mapping dimension, K is the number of episodes, H is the number of steps in each episode, and ∆ c is a safety-related parameter. We also provide a lower bound Ω(max{dH, which indicates that the dependency on ∆ c is necessary. Further, both our algorithm design and regret analysis involve several novel ideas, which may be of independent interest. Recently, instantaneous hard constraints have been studied in theoretical machine learning. Specifically, [12] and [18] studied bandits with linear instantaneous constraints that require a linear safety value of the chosen action to be bounded at each step. However, it is well-known that bandits are only a very special case of MDP. [14] studied safe linear MDP with linear instantaneous hard constraints. However, they still assume that only the actions could be unsafe, and hence unsafe states (and transitions) are still not considered. Intuitively, when there are only unsafe actions, any action will

show abstract

Section: Related Workmentioning

confidence: 99%

“…in Step-3 of Algorithm 1, we update the regularized least-square estimator of the parameter w * h in (11) as follows:…”

Section: A Near-optimal Safe Algorithmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints

Shi¹,

Liang²,

Shroff³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…It is applicable to many constrained control problems by integrating other system specifications in constraints, and admits a natural extension of constrained optimization and Lagrangian over policies. Lagrangian-based policy search methods, especially policy-based primal-dual algorithms that work simultaneously with primal/dual variables, lie at the heart of recent successes of constrained MDPs, e.g., navigation [77], autonomous driving [54,49], robotics [25], and finance [31]; see [44,35,20,48] for more examples.…”

Section: Introductionmentioning

confidence: 99%

Policy gradient primal-dual mirror descent for constrained MDPs with large state spaces

Ding

Jovanović

2022

2022 IEEE 61st Conference on Decision and Control (CDC)

View full text Add to dashboard Cite

We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. Specifically, we first propose a regularized policy gradient primal-dual (RPG-PD) method that updates the policy using an entropy-regularized policy gradient, and the dual via a quadratic-regularized gradient ascent, simultaneously. We prove that the policy primal-dual iterates of RPG-PD converge to a regularized saddle point with a sublinear rate, while the policy iterates converge sublinearly to an optimal constrained policy. We further instantiate RPG-PD in large state or action spaces by including function approximation in policy parametrization, and establish similar sublinear last-iterate policy convergence. Second, we propose an optimistic policy gradient primal-dual (OPG-PD) method that employs the optimistic gradient method to update primal/dual variables, simultaneously. We prove that the policy primal-dual iterates of OPG-PD converge to a saddle point that contains an optimal constrained policy, with a linear rate. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs. We further exhibit the merits and effectiveness of our methods in computational experiments.

show abstract

Data-driven control for dynamic quantized nonlinear systems with state constraints based on barrier functions

Wang,

Zhao,

Wang

et al. 2023

Information Sciences

View full text Add to dashboard Cite

Safe Policies for Reinforcement Learning via Primal-Dual Methods

Cited by 33 publications

References 40 publications

A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints

A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints

Policy gradient primal-dual mirror descent for constrained MDPs with large state spaces

Data-driven control for dynamic quantized nonlinear systems with state constraints based on barrier functions

Contact Info

Product

Resources

About