Meta-Q-Learning

Fakoor, Rasool; Chaudhari, Pratik; Soatto, Stefano; Smola, Alexander J.

doi:10.48550/arxiv.1910.00125

Cited by 20 publications

(32 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…That is, the RL learning does not require the latest updated policy to interact with the environment. Rather, it can leverage experiences from other policies, such as learning from replaying buffer samples from old policy as used in [18]. Examples of this category include [19] [20] where their off-policy Meta-RL algorithms were developed by decoupling the task inference from the policy training.…”

Section: Related Workmentioning

confidence: 99%

Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization

Wen,

Zhang,

Tseng

et al. 2021

Preprint

View full text Add to dashboard Cite

The field of Meta Reinforcement Learning (Meta-RL) has seen substantial advancements recently. In particular, off-policy methods were developed to improve the data efficiency of Meta-RL techniques. Probabilistic embeddings for actor-critic RL (PEARL) is currently one of the leading approaches for multi-MDP adaptation problems. A major drawback of many existing Meta-RL methods, including PEARL, is that they do not explicitly consider the safety of the prior policy when it is exposed to a new task for the very first time. This is very important for some real world applications, including field robots and Autonomous Vehicles (AVs). In this paper, we develop the PEARL PLUS (PEARL + ) algorithm, which optimizes the policy for both prior safety and posterior adaptation. Building on top of PEARL, our proposed PEARL + algorithm introduces a prior regularization term in the reward function and a new Q-network for recovering the state-action value with prior context assumption, to improve the robustness and safety of the trained network exposing to a new task for the first time. The performance of the PEARL + method is demonstrated by solving three safety-critical decision making problems related to robots and AVs, including two MuJoCo benchmark problems. From the simulation experiments, we show that the safety of the prior policy is significantly improved compared to that of the original PEARL method.

show abstract

Section: Related Workmentioning

confidence: 99%

Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization

Wen,

Zhang,

Tseng

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In the meta-testing phase, D test = {D (k) } N k=K+1 are sampled from the same task distribution. Although the meta-optimization approaches [14,23] have been successfully applied to various image classification tasks, their performance is relatively limited in RL tasks [13]. Recent advance of the context approach meta-RL [27] learns a latent representation of the task and construct a context model through recurrent networks [18,8].…”

Section: Preliminariesmentioning

confidence: 99%

“…Thereby, the learning efficiency will be limited [2]. In order to solve the limited state visitation problem in ET-MDP, we adopt the idea of context models, previously introduced in Meta-RL literature [13,27] to improve the generality of policies across different training tasks. In our setting of ET-MDP, a context variable is learned to improve the generality of the learned policy over different states, thus it enables our policy to perform safely over different states within one task.…”

Section: Introductionmentioning

confidence: 99%

Safe Exploration by Solving Early Terminated MDP

Sun¹,

Xu²,

Fang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Safe exploration is crucial for the real-world application of reinforcement learning (RL). Previous works consider the safe exploration problem as Constrained Markov Decision Process (CMDP), where the policies are being optimized under constraints. However, when encountering any potential dangers, human tends to stop immediately and rarely learns to behave safely in danger. Motivated by human learning, we introduce a new approach to address safe RL problems under the framework of Early Terminated MDP (ET-MDP). We first define the ET-MDP as an unconstrained MDP with the same optimal value function as its corresponding CMDP. An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better asymptotic performance and improved learning efficiency. Experiments on various CMDP tasks show a substantial improvement over previous methods that directly solve CMDP.

show abstract

“…There are many concrete formulations of meta-RL (see, e.g. (Wang et al, 2015;Duan et al, 2016;Houthooft et al, 2018;Rakelly et al, 2019;Zintgraf et al, 2019;Fakoor et al, 2019;Ortega et al, 2019;), Our focus is meta-RL through gradient-based adaptations (Finn et al, 2017), where the agent carries out policy gradient (PG) inner loop updates (Sutton et al, 2000) at both meta-training and meta-testing time.…”

Section: Introductionmentioning

confidence: 99%

Biased Gradient Estimate with Drastic Variance Reduction for Meta Reinforcement Learning

Tang¹

2021

Preprint

View full text Add to dashboard Cite

Despite the empirical success of meta reinforcement learning (meta-RL), there are still a number poorly-understood discrepancies between theory and practice. Critically, biased gradient estimates are almost always implemented in practice, whereas prior theory on meta-RL only establishes convergence under unbiased gradient estimates. In this work, we investigate such a discrepancy. In particular, (1) We show that unbiased gradient estimates have variance Θ(N ) which linearly depends on the sample size N of the inner loop updates; (2) We propose linearized score function (LSF) gradient estimates, which have bias O(1/ √ N ) and variance O(1/N );(3) We show that most empirical prior work in fact implements variants of the LSF gradient estimates. This implies that practical algorithms "accidentally" introduce bias to achieve better performance; (4) We establish theoretical guarantees for the LSF gradient estimates in meta-RL regarding its convergence to stationary points, showing better dependency on N than prior work when N is large.

show abstract

Meta-Q-Learning

Cited by 20 publications

References 17 publications

Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization

Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization

Safe Exploration by Solving Early Terminated MDP

Biased Gradient Estimate with Drastic Variance Reduction for Meta Reinforcement Learning

Contact Info

Product

Resources

About