Policy Gradient for s-Rectangular Robust Markov Decision Processes

Kumar, Nand; Derman, Esther; Geist, Matthieu; Levy, Kfir Y.; Mannor, Shie

doi:10.48550/arxiv.2301.13589

Cited by 1 publication

(3 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In fact, except for those considered in (Mannor, Mebel, and Xu 2016;Goyal and Grand-Clement 2023) which are locally coupled, s-rectangular uncertainty sets represent the largest class of tractable RMDPs. On the other hand, if not the studies (Xu and Mannor 2010;Mannor, Mebel, and Xu 2016;Derman, Geist, and Mannor 2021;Kumar et al 2023) that treat both reward and transition uncertainty, RMDP literature has mostly focused just on transition uncertainty. We believe this is due to the greater challenge it represents, as the repercussions of transition ambiguity are epistemic and can lead to a butterfly effect: a small kernel deviation at some state can have an unpredictable effect on another state so we are no longer able to track how local kernel uncertainty propagates across the state space.…”

Section: Related Workmentioning

confidence: 99%

“…In that respect, the robust policy gradient methods recently introduced in (Wang and Zou 2022; Kumar et al 2023;Li, Zhao, and Lan 2022) assume the uncertainty set to be rectangular. Although Wang and Zou (2022) did prove convergence in the non-rectangular case, their analysis exclusively focused on transition uncertainty while they assumed oracle access to the policy gradient.…”

Section: Related Workmentioning

confidence: 99%

“…In this regard, we introduce the following result: Proposition 12. (Lemma 1 of (Kumar et al 2023)) For all policies π and kernels P , the iterative sequence given by…”

Section: Scaling Reward-robust Policy-gradientmentioning

confidence: 99%

See 2 more Smart Citations

Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization

Gadot,

Derman,

Kumar

et al. 2024

AAAI

View full text Add to dashboard Cite

In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an alpha-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.

show abstract