2022 IEEE 61st Conference on Decision and Control (CDC) 2022
DOI: 10.1109/cdc51059.2022.9992419
|View full text |Cite
|
Sign up to set email alerts
|

Policy gradient primal-dual mirror descent for constrained MDPs with large state spaces

Abstract: We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 77 publications
(246 reference statements)
0
3
0
Order By: Relevance
“…(2020) developed a martingale approach to learn policies that are sensitive to the uncertainty of the rewards and are meaningful under some market scenarios. Another line of work focuses on constrained RL problems with different risk criteria (Achiam et al., 2017; Chow et al., 2017, 2015; Ding et al., 2021; Tamar et al., 2015; Zheng & Ratliff, 2020). Very recently, Jaimungal et al.…”
Section: Further Developments For Mathematical Finance and Reinforcem...mentioning
confidence: 99%
“…(2020) developed a martingale approach to learn policies that are sensitive to the uncertainty of the rewards and are meaningful under some market scenarios. Another line of work focuses on constrained RL problems with different risk criteria (Achiam et al., 2017; Chow et al., 2017, 2015; Ding et al., 2021; Tamar et al., 2015; Zheng & Ratliff, 2020). Very recently, Jaimungal et al.…”
Section: Further Developments For Mathematical Finance and Reinforcem...mentioning
confidence: 99%
“…Then, a suboptimal policy is obtained by iteratively solving the subproblem (8) with linear approximations on the objective and the safety constraint and quadratic approximations on the KL divergence term.…”
Section: Constrained Policy Optimizationmentioning
confidence: 99%
“…Thus, we approximate the upper bound on CVaR by removing the divergence term in ( 14) and add a trust-region constraint as in [12] and [15]. Then, the proposed CVaR-constrained subproblem can be written as below by replacing the constraint in (8) with the approximated CVaR.…”
Section: B Policy Optimization In Trust Regionmentioning
confidence: 99%