Risk Averse Robust Adversarial Reinforcement Learning

Pan, Xinlei; Seita, Daniel; Gao, Yang; Canny, John

doi:10.1109/icra.2019.8794293

Cited by 61 publications

(36 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The computation of the subgradient in (30) requires the exact value of J c (X k ), which cannot be obtained in the sample-based setting. Thus, we estimate it by…”

Section: B a Sample-based Primal-dual Algorithmmentioning

confidence: 99%

Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs

Zhao¹,

You²,

Başar³

2021

Preprint

View full text Add to dashboard Cite

While the techniques in optimal control theory are often model-based, the policy optimization (PO) approach can directly optimize the performance metric of interest without explicit dynamical models, and is an essential approach for reinforcement learning problems. However, it usually leads to a non-convex optimization problem in most cases, where there is little theoretical understanding on its performance. In this paper, we focus on the risk-constrained Linear Quadratic Regulator (LQR) problem with noisy input via the PO approach, which results in a challenging non-convex problem. To this end, we first build on our earlier result that the optimal policy has an affine structure to show that the associated Lagrangian function is locally gradient dominated with respect to the policy, based on which we establish strong duality. Then, we design policy gradient primal-dual methods with global convergence guarantees to find an optimal policy-multiplier pair in both model-based and sample-based settings. Finally, we use samples of system trajectories in simulations to validate our policy gradient primaldual methods.

show abstract

“…The computation of the subgradient in (30) requires the exact value of J c (X k ), which cannot be obtained in the sample-based setting. Thus, we estimate it by…”

Section: B a Sample-based Primal-dual Algorithmmentioning

confidence: 99%

Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs

Zhao¹,

You²,

Başar³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Theorem 2: (weighted sup-norm bound) Let V * be the approximate value function solution to (36), and V * be the solution to (9). Then,…”

Section: B One-shot Semi-infinite-dimensional Convex Programmentioning

confidence: 99%

“…On the other hand, entropic risk measures leverage exponential cost functions to simultaneously optimize the average cost and its variance [33]- [34]. To account for epistemic uncertainties, policy gradient (PG) method [35], as an RL algorithm, has been leveraged to learn the solution to value-at-risk setting [36]- [39] and exponential utility setting [34]. PG algorithms explicitly parameterize the policy and update the control parameters in the direction of the gradient of the performance.…”

Section: Introductionmentioning

confidence: 99%

A Convex Programming Approach to Data-Driven Risk-Averse Reinforcement Learning

Han¹,

Mazouchi²,

Nageshrao³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper presents a model-free reinforcement learning (RL) algorithm to solve the risk-averse optimal control (RAOC) problem for discrete-time nonlinear systems. While successful RL algorithms have been presented to learn optimal control solutions under epistemic uncertainties (i.e., lack of knowledge of system dynamics), they do so by optimizing the expected utility of outcomes, which ignores the variance of cost under aleatory uncertainties (i.e., randomness). Performance-critical systems, however, must not only optimize the expected performance, but also reduce its variance to avoid performance fluctuation during RL's course of operation. To solve the RAOC problem, this paper presents the following three variants of RL algorithms and analyze their advantages and preferences for different situations/systems: 1) a one-shot static convex program -based RL, 2) an iterative value iteration (VI) algorithm that solves a linear programming (LP) optimization at each iteration, and 3) an iterative policy iteration (PI) algorithm that solves a convex optimization at each iteration and guarantees the stability of the consecutive control policies. Convergence of the exact optimization problems, which are infinite dimensional in all three cases, to the optimal risk-averse value function is shown. To turn these optimization problems into standard optimization problems with finite decision variables and constraints, function approximation for value estimations as well as constraint sampling are leveraged. Data-driven implementations of these algorithms are provided based on Q-function which enables learning the optimal value without any knowledge of the system dynamics. The performance of the approximated solutions is also verified through a weighted sup-norm bound and the Lyapunov bound. A simulation example is provided to verify the effectiveness of the presented approach.

show abstract

“…Robust planning in RL Robustness in RL has been heavily studied, both in the context of robust adversarial RL [Pinto et al, 2017, Pan et al, 2019, Zhang et al, 2020a and nonstationarity in multi-agent RL settings [Li et al, 2019, Zhang et al, 2020b. For example, PSRO extends double oracle from state-independent pure strategies to policy-space strategies to be used for multiplayer competitive games [Lanctot et al, 2017].…”

Section: Related Workmentioning

confidence: 99%

Robust Reinforcement Learning Under Minimax Regret for Green Security

Xu,

Perrault,

Fang

et al. 2021

Preprint

View full text Add to dashboard Cite

Green security domains feature defenders who plan patrols in the face of uncertainty about the adversarial behavior of poachers, illegal loggers, and illegal fishers. Importantly, the deterrence effect of patrols on adversaries' future behavior makes patrol planning a sequential decision-making problem. Therefore, we focus on robust sequential patrol planning for green security following the minimax regret criterion, which has not been considered in the literature. We formulate the problem as a game between the defender and nature who controls the parameter values of the adversarial behavior and design an algorithm MIRROR to find a robust policy. MIR-ROR uses two reinforcement learning-based oracles and solves a restricted game considering limited defender strategies and parameter values. We evaluate MIRROR on real-world poaching data.

show abstract

Risk Averse Robust Adversarial Reinforcement Learning

Cited by 61 publications

References 18 publications

Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs

Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs

A Convex Programming Approach to Data-Driven Risk-Averse Reinforcement Learning

Robust Reinforcement Learning Under Minimax Regret for Green Security

Contact Info

Product

Resources

About