Motivated by the recent empirical success of policy-based reinforcement learning (RL), there has been a research trend studying the performance of policy-based RL methods on standard control benchmark problems. In this paper, we examine the effectiveness of policy-based RL methods on an important robust control problem, namely µ synthesis. We build a connection between robust adversarial RL and µ synthesis, and develop a model-free version of the wellknown DK-iteration for solving state-feedback µ synthesis with static D-scaling. In the proposed algorithm, the K step mimics the classical central path algorithm via incorporating a recently-developed double-loop adversarial RL method as a subroutine, and the D step is based on model-free finite difference approximation. Extensive numerical study is also presented to demonstrate the utility of our proposed modelfree algorithm. Our study sheds new light on the connections between adversarial RL and robust control.
The growing prospect of deep reinforcement learning (DRL) being used in cyber-physical systems has raised concerns around safety and robustness of autonomous agents. Recent work on generating adversarial attacks have shown that it is computationally feasible for a bad actor to fool a DRL policy into behaving sub optimally. Although certain adversarial attacks with specific attack models have been addressed, most studies are only interested in off-line optimization in the data space (e.g., example fitting, distillation). This paper introduces a Meta-Learned Advantage Hierarchy (MLAH) framework that is attack model-agnostic and more suited to reinforcement learning, via handling the attacks in the decision space (as opposed to data space) and directly mitigating learned bias introduced by the adversary. In MLAH, we learn separate sub-policies (nominal and adversarial) in an online manner, as guided by a supervisory master agent that detects the presence of the adversary by leveraging the advantage function for the sub-policies. We demonstrate that the proposed algorithm enables policy learning with significantly lower bias as compared to the state-of-the-art policy learning approaches even in the presence of heavy state information attacks. We present algorithm analysis and simulation results using popular OpenAI Gym environments.
Many existing region-of-attraction (ROA) analysis tools find difficulty in addressing feedback systems with large-scale neural network (NN) policies and/or highdimensional sensing modalities such as cameras. In this letter, we tailor the projected gradient descent (PGD) attack method as a general-purpose ROA analysis tool for highdimensional nonlinear systems and end-to-end perceptionbased control. We show that the ROA analysis can be approximated as a constrained maximization problem such that PGD-based iterative methods can be directly applied. In the model-based setting, we show that the PGD updates can be efficiently performed using back-propagation. In the model-free setting, we propose a finite-difference PGD estimate which is general and only requires a black-box simulator for generating the trajectories of the closed-loop system given any initial state. Finally, we demonstrate the scalability and generality of our analysis tool on several numerical examples with large state dimensions or complex image observations.
When applying imitation learning techniques to fit a policy from expert demonstrations, one can take advantage of prior stability/robustness assumptions on the expert's policy and incorporate such control-theoretic prior knowledge explicitly into the learning process. In this paper, we formulate the imitation learning of linear policies as a constrained optimization problem, and present efficient methods which can be used to enforce stability and robustness constraints during the learning processes. Specifically, we show that one can guarantee the closed-loop stability and robustness by posing linear matrix inequality (LMI) constraints on the fitted policy. Then both the projected gradient descent method and the alternating direction method of multipliers (ADMM) method can be applied to solve the resulting constrained policy fitting problem. Finally, we provide numerical results to demonstrate the effectiveness of our methods in producing linear polices with various stability and robustness guarantees.
Reinforcement learning (RL) is a machine learning paradigm in which an agent attempts to learn a control policy that can generate the desired sequence of actions for achieving a higher level objective. RL promises to provide a learning mechanism via which autonomous agents can learn to control themselves directly through experience, without requiring manual coding of control policies. Similar to other machine learning paradigms, RL research heavily focuses on end-to-end learning, which in this case is learning of policies directly through experience. Recent successes of RL have shown that agents can learn to decision making and control policies on complex simulations for which it would have been very difficult to manually create control policies. Some examples include chess, go, and more recently complex continuous timesimulated domains. Some pressing issues include sample complexity, robustness, and reliable simulations to the real-world transfer.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.