Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Ding, Yuhao; Lavaei, Javad

doi:10.48550/arxiv.2201.11965

Cited by 4 publications

(6 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Different from the general concave case (18), the bound (21) does not contain the constant error term O(ε). Thus, by choosing η 2 = T −1 2 , the average performance has the order O T −1 2 .…”

Section: Assumption 41 (Parameterization)mentioning

confidence: 99%

See 1 more Smart Citation

Policy-based Primal-Dual Methods for Convex Constrained Markov Decision Processes

Ying¹,

Guo²,

Ding³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We study convex Constrained Markov Decision Processes (CMDPs) in which the objective is concave and the constraints are convex in the state-action visitation distribution. We propose a policy-based primal-dual algorithm that updates the primal variable via policy gradient ascent and updates the dual variable via projected sub-gradient descent. Despite the loss of additivity structure and the nonconvex nature, we establish the global convergence of the proposed algorithm by leveraging a hidden convexity in the problem under the general soft-max parameterization, and prove the O T −1 3 convergence rate in terms of both optimality gap and constraint violation. When the objective is strongly concave in the visitation distribution, we prove an improved convergence rate of O T −1 2 . By introducing a pessimistic term to the constraint, we further show that a zero constraint violation can be achieved while preserving the same convergence rate for the optimality gap. This work is the first one in the literature that establishes non-asymptotic convergence guarantees for policy-based primal-dual methods for solving infinite-horizon discounted convex CMDPs.

show abstract

Section: Assumption 41 (Parameterization)mentioning

confidence: 99%

“…CMDP Our work is also pertinent to policy-based CMDP algorithms [10,[19][20][21][22][23]. In particular, [13] develops a natural policy gradient-based primal-dual algorithm and shows that it enjoys an O(T −1 2 ) global convergence rate regarding both the optimality gap and the constraint violation under the soft-max parameterization.…”

Section: Related Workmentioning

confidence: 99%

Policy-based Primal-Dual Methods for Convex Constrained Markov Decision Processes

Ying¹,

Guo²,

Ding³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…CMDP Our work is also pertinent to policy-based CMDP algorithms (Altman 1999;Borkar 2005;Achiam et al 2017;Ding and Lavaei 2022;Chow et al 2017;Efroni, Mannor, and Pirotta 2020). In particular, Ding et al (2020) develops a natural policy gradient-based primal-dual algorithm and shows that it enjoys an O(T −1/2 ) global convergence rate regarding both the optimality gap and the constraint violation under the standard soft-max parameterization.…”

Section: Related Workmentioning

confidence: 99%

Policy-Based Primal-Dual Methods for Convex Constrained Markov Decision Processes

Ying

Guo

Ding

et al. 2023

AAAI

View full text Add to dashboard Cite

We study convex Constrained Markov Decision Processes (CMDPs) in which the objective is concave and the constraints are convex in the state-action occupancy measure. We propose a policy-based primal-dual algorithm that updates the primal variable via policy gradient ascent and updates the dual variable via projected sub-gradient descent. Despite the loss of additivity structure and the nonconvex nature, we establish the global convergence of the proposed algorithm by leveraging a hidden convexity in the problem, and prove the O(T^-1/3) convergence rate in terms of both optimality gap and constraint violation. When the objective is strongly concave in the occupancy measure, we prove an improved convergence rate of O(T^-1/2). By introducing a pessimistic term to the constraint, we further show that a zero constraint violation can be achieved while preserving the same convergence rate for the optimality gap. This work is the first one in the literature that establishes non-asymptotic convergence guarantees for policy-based primal-dual methods for solving infinite-horizon discounted convex CMDPs.

show abstract

“…Non-stationary RL has been mostly studied in the risk-neutral setting. When the variation budget is known a prior, a common strategy for adapting to the non-stationarity is to follow the forgetting principle, such as the restart strategy (Mao et al 2020;Zhou et al 2020;Zhao et al 2020;Ding and Lavaei 2022), exponential decayed weights (Touati and Vincent 2020), or sliding window (Cheung, Simchi-Levi, and Zhu 2020;Zhong, Yang, and Szepesvári 2021). In this work, we focus on the restart method mainly due to its advantage of the simplicity of the the memory efficiency (Zhao et al 2020) and generalize it to the risk-sensitive RL setting.…”

Section: Related Workmentioning

confidence: 99%

Non-stationary Risk-Sensitive Reinforcement Learning: Near-Optimal Dynamic Regret, Adaptive Detection, and Separation Design

Ding

Jin

Lavaei

2023

AAAI

View full text Add to dashboard Cite

We study risk-sensitive reinforcement learning (RL) based on an entropic risk measure in episodic non-stationary Markov decision processes (MDPs). Both the reward functions and the state transition kernels are unknown and allowed to vary arbitrarily over time with a budget on their cumulative variations. When this variation budget is known a prior, we propose two restart-based algorithms, namely Restart-RSMB and Restart-RSQ, and establish their dynamic regrets. Based on these results, we further present a meta-algorithm that does not require any prior knowledge of the variation budget and can adaptively detect the non-stationarity on the exponential value functions. A dynamic regret lower bound is then established for non-stationary risk-sensitive RL to certify the near-optimality of the proposed algorithms. Our results also show that the risk control and the handling of the non-stationarity can be separately designed in the algorithm if the variation budget is known a prior, while the non-stationary detection mechanism in the adaptive algorithm depends on the risk parameter. This work offers the first non-asymptotic theoretical analyses for the non-stationary risk-sensitive RL in the literature.

show abstract

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Cited by 4 publications

References 20 publications

Policy-based Primal-Dual Methods for Convex Constrained Markov Decision Processes

Policy-based Primal-Dual Methods for Convex Constrained Markov Decision Processes

Policy-Based Primal-Dual Methods for Convex Constrained Markov Decision Processes

Non-stationary Risk-Sensitive Reinforcement Learning: Near-Optimal Dynamic Regret, Adaptive Detection, and Separation Design

Contact Info

Product

Resources

About