Policy-based Primal-Dual Methods for Convex Constrained Markov Decision Processes

Ying, Donghao; Guo, Mengzi; Ding, Yuhao; Lavaei, Javad; Zuo-Jun,; Shen,

doi:10.48550/arxiv.2205.10715

Cited by 1 publication

(1 citation statement)

References 16 publications

(52 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We notice that recent last-iterate convergence result for convex-concave saddle-point problems [24] is also applicable to Problem (17), which provides the optimal rate without problem-dependent constants. It is worth mentioning that direct application of such last-iterate convergence results in convex minimax optimization to constrained MDPs with general utilities [9,111] and convex MDPs [120,116] in occupancy-measure space is also straightforward. We omit these exercises in this paper, and focus on the design and analysis of algorithms in policy space.…”

Section: B4 Constrained Mdps In Occupancy-measure Spacementioning

confidence: 99%

Policy gradient primal-dual mirror descent for constrained MDPs with large state spaces

Ding

Jovanović

2022

2022 IEEE 61st Conference on Decision and Control (CDC)

View full text Add to dashboard Cite

We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. Specifically, we first propose a regularized policy gradient primal-dual (RPG-PD) method that updates the policy using an entropy-regularized policy gradient, and the dual via a quadratic-regularized gradient ascent, simultaneously. We prove that the policy primal-dual iterates of RPG-PD converge to a regularized saddle point with a sublinear rate, while the policy iterates converge sublinearly to an optimal constrained policy. We further instantiate RPG-PD in large state or action spaces by including function approximation in policy parametrization, and establish similar sublinear last-iterate policy convergence. Second, we propose an optimistic policy gradient primal-dual (OPG-PD) method that employs the optimistic gradient method to update primal/dual variables, simultaneously. We prove that the policy primal-dual iterates of OPG-PD converge to a saddle point that contains an optimal constrained policy, with a linear rate. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs. We further exhibit the merits and effectiveness of our methods in computational experiments.

show abstract