Unsupervised Meta-Learning for Reinforcement Learning

Gupta, Abhishek; Eysenbach, Benjamin; Finn, Chelsea; Levine, Sergey

doi:10.48550/arxiv.1806.04640

Cited by 32 publications

(45 citation statements)

References 20 publications

(37 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, reward binds the agent to a certain task for which the reward represents success. Aligned with the recent surge of interest in unsupervised methods in reinforcement learning (Baranes and Oudeyer, 2013;Bellemare et al, 2016;Gregor et al, 2016;Houthooft et al, 2016;Gupta et al, 2018;Hausman et al, 2018;Pong et al, 2019;Laskin et al, 2020Laskin et al, , 2021He et al, 2021) and previously proposed ideas (Schmidhuber, 1991a(Schmidhuber, , 2010, we argue that there exist properties of a dynamical system which are not tied to any particular task, yet highly useful, leveraging them can help solve other tasks more efficiently. This work focuses on the sensitivity of the produced trajectories of the system with respect to the policy so-called Physical Derivatives.…”

Section: Introductionsupporting

confidence: 54%

Physical Derivatives: Computing policy gradients by physical forward-propagation

Mehrjou¹,

Soleymani²,

Bauer³

et al. 2022

Preprint

View full text Add to dashboard Cite

Model-free and model-based reinforcement learning are two ends of a spectrum. Learning a good policy without a dynamic model can be prohibitively expensive. Learning the dynamic model of a system can reduce the cost of learning the policy, but it can also introduce bias if it is not accurate. We propose a middle ground where instead of the transition model, the sensitivity of the trajectories with respect to the perturbation of the parameters is learned. This allows us to predict the local behavior of the physical system around a set of nominal policies without knowing the actual model. We assay our method on a custom-built physical robot in extensive experiments and show the feasibility of the approach in practice. We investigate potential challenges when applying our method to physical systems and propose solutions to each of them.

show abstract

Section: Introductionsupporting

confidence: 54%

Physical Derivatives: Computing policy gradients by physical forward-propagation

Mehrjou¹,

Soleymani²,

Bauer³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In goal-based RL where future states can inform "optimal" reward parameters with respect to the transitions' actions, hindsight methods were applied successfully to enable effective training of goal-based Q-function for sparse rewards , derive exact connections between Q-learning and classic model-based RL , dataefficient off-policy hierarchical RL (Nachum et al, 2018), multi-task RL Li et al, 2020), offline RL (Chebotar et al, 2021), and more Choi et al, 2021;Ren et al, 2019;Zhao & Tresp, 2018;Ghosh et al, 2021;Nasiriany et al, 2021). Additionally, Lynch et al (2019) and Gupta et al (2018) have shown that often BC is sufficient for learning generalizable parameterized policies, due to rich positive examples from future states, and most recently Chen et al (2021a) and Janner et al (2021), when combined with powerful transformer architectures (Vaswani et al, 2017), it produced state-of-the-art offline RL and goal-based RL results. Lastly, while motivated from alternative mathematical principles and not for parameterized objectives, future state information was also explored as ways of reducing variance or improving estimations for generic policy gradient methods (Pinto et al, 2017;Guo et al, 2021;Venuto et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

Generalized Decision Transformer for Offline Hindsight Information Matching

Furuta¹,

Matsuo²,

Gu³

2021

Preprint

View full text Add to dashboard Cite

How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -such as future states in hindsight experience replay (HER) or returnsto-go in Decision Transformer (DT) -enables efficient learning of multi-task policies, where at times online RL is fully replaced by offline behavioral cloning (BC), e.g. sequence modeling. We demonstrate that all these approaches are doing hindsight information matching (HIM) -training policies that can output the rest of trajectory that matches some statistics of future state information. We present Generalized Decision Transformer (GDT) for solving any HIM problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future. For evaluating CDT and BDT, we define offline multi-task state-marginal matching (SMM) and imitation learning (IL) as two generic HIM problems, propose a Wasserstein distance loss as a metric for both, and empirically study them on MuJoCo continuous control benchmarks. Categorical DT, which simply replaces anti-causal summation with anti-causal binning in DT, enables arguably the first effective offline multi-task SMM algorithm that generalizes well to unseen (and even synthetic) multi-modal reward or state-feature distributions. Bi-directional DT, which uses an anti-causal second transformer as the aggregator, can learn to model any statistics of the future and outperforms DT variants in offline multi-task IL, i.e. one-shot IL. Our generalized formulations from HIM and GDT greatly expand the role of powerful sequence modeling architectures in modern RL.

show abstract

“…Meta-learning applications in NLP have yielded improvements on specific tasks (Gu et al, 2018;Han et al, 2018;Dou et al, 2019). Unsupervised meta-learning has been explored in computer vision Khodadadeh et al, 2019) and reinforcement learning (Gupta et al, 2018). cluster images using pre-trained embeddings to create tasks.…”

Section: Related Workmentioning

confidence: 99%

Diverse Distributions of Self-Supervised Tasks for Meta-Learning in NLP

Bansal¹,

Prasad²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Meta-learning considers the problem of learning an efficient learning process that can leverage its past experience to accurately solve new tasks. However, the efficacy of meta-learning crucially depends on the distribution of tasks available for training, and this is often assumed to be known a priori or constructed from limited supervised datasets. In this work, we aim to provide task distributions for meta-learning by considering self-supervised tasks automatically proposed from unlabeled text, to enable large-scale meta-learning in NLP. We design multiple distributions of self-supervised tasks by considering important aspects of task diversity, difficulty, type, domain, and curriculum, and investigate how they affect meta-learning performance. Our analysis shows that all these factors meaningfully alter the task distribution, some inducing significant improvements in downstream few-shot accuracy of the metalearned models. Empirically, results on 20 downstream tasks show significant improvements in few-shot learning -adding up to +4.2% absolute accuracy (on average) to the previous unsupervised meta-learning method, and perform comparably to supervised methods on the FewRel 2.0 benchmark.

show abstract

Unsupervised Meta-Learning for Reinforcement Learning

Cited by 32 publications

References 20 publications

Physical Derivatives: Computing policy gradients by physical forward-propagation

Physical Derivatives: Computing policy gradients by physical forward-propagation

Generalized Decision Transformer for Offline Hindsight Information Matching

Diverse Distributions of Self-Supervised Tasks for Meta-Learning in NLP

Contact Info

Product

Resources

About