Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills

Chebotar, Yevgen; Hausman, Karol; Lu, Yao; Ted, Xiao,; Kalashnikov, Dmitry; Varley, Jake; Irpan, Alex; Eysenbach, Benjamin; Julian, Ryan R.; Finn, Chelsea; Levine, Sergey

doi:10.48550/arxiv.2104.07749

Cited by 20 publications

(44 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the preceding deterministic MDP formulation, we aim at solving a goal-reaching RL problem (Kaelbling, 1993b;Sutton et al, 2011;Andrychowicz et al, 2017;Andreas et al, 2017;Pong et al, 2018;Ghosh et al, 2019;Eysenbach et al, 2020aEysenbach et al, , 2020bKadian et al, 2020;Fujita et al, 2020;Chebotar et al, 2021;Khazatsky et al, 2021) or a planning problem (Bertsekas & Tsitsiklis, 1996;Boutilier et al, 1999;Sutton et al, 1999;Boutilier et al, 2000;Rintanen & Hoffmann, 2001;LaValle, 2006;Russell & Norvig, 2009;Nasiriany et al, 2019). We say a Q-function is successful if its associated greedy policy (Sutton & Barto, 2018)…”

Section: Successful Q-functionsmentioning

confidence: 99%

Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

Zhai¹,

Baek²,

Zhou³

et al. 2022

jair

View full text Add to dashboard Cite

Many goal-reaching reinforcement learning (RL) tasks have empirically verified that rewarding the agent on subgoals improves convergence speed and practical performance. We attempt to provide a theoretical framework to quantify the computational benefits of rewarding the completion of subgoals, in terms of the number of synchronous value iterations. In particular, we consider subgoals as one-way intermediate states, which can only be visited once per episode and propose two settings that consider these one-way intermediate states: the one-way single-path (OWSP) and the one-way multi-path (OWMP) settings. In both OWSP and OWMP settings, we demonstrate that adding intermediate rewards to subgoals is more computationally efficient than only rewarding the agent once it completes the goal of reaching a terminal state. We also reveal a trade-off between computational complexity and the pursuit of the shortest path in the OWMP setting: adding intermediate rewards significantly reduces the computational complexity of reaching the goal but the agent may not find the shortest path, whereas with sparse terminal rewards, the agent finds the shortest path at a significantly higher computational cost. We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and some popular deep RL algorithms.

show abstract

Section: Successful Q-functionsmentioning

confidence: 99%

Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

Zhai¹,

Baek²,

Zhou³

et al. 2022

jair

View full text Add to dashboard Cite

show abstract

“…"relabel"-able, like goal reaching rewards, and (2) if policy or Q-functions are smooth with respect to the reward parameter, generalization can speed up learning even with respect to "unexplored" rewards. In goal-based RL where future states can inform "optimal" reward parameters with respect to the transitions' actions, hindsight methods were applied successfully to enable effective training of goal-based Q-function for sparse rewards , derive exact connections between Q-learning and classic model-based RL , dataefficient off-policy hierarchical RL (Nachum et al, 2018), multi-task RL Li et al, 2020), offline RL (Chebotar et al, 2021), and more Choi et al, 2021;Ren et al, 2019;Zhao & Tresp, 2018;Ghosh et al, 2021;Nasiriany et al, 2021). Additionally, Lynch et al (2019) and Gupta et al (2018) have shown that often BC is sufficient for learning generalizable parameterized policies, due to rich positive examples from future states, and most recently Chen et al (2021a) and Janner et al (2021), when combined with powerful transformer architectures (Vaswani et al, 2017), it produced state-of-the-art offline RL and goal-based RL results.…”

Section: Related Workmentioning

confidence: 99%

“…Orthogonal to these, in the recent years we have seen a number of algorithms that are derived from different motivations and frameworks, but share the following common trait: they use future trajectory information τ t:T to accelerate optimization of a contextual policy π(a t |s t , z) with context z with respect to a parameterized reward function r(s t , a t , z) (see Section 3 for notations). These hindsight algorithms have enabled Q-learning with sparse rewards , temporally-extended model-based RL with Q-function , mastery of 6-DoF object manipulation in cluttered scenes from human play (Lynch et al, 2019), efficient multi-task RL Li et al, 2020), offline self-supervised discovery of manipulation primitives from pixels (Chebotar et al, 2021), and offline RL using return-conditioned supervised learning with transformers (Chen et al, 2021a;Janner et al, 2021). We derive a generic problem formulation covering all these variants, and observe that this hindsight information matching (HIM) framework, with behavioral cloning (BC) as the learning objective, can learn a conditional policy to generate trajectories that each satisfy any properties, including distributional.…”

Section: Introductionmentioning

confidence: 99%

Generalized Decision Transformer for Offline Hindsight Information Matching

Furuta¹,

Matsuo²,

Gu³

2021

Preprint

View full text Add to dashboard Cite

How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -such as future states in hindsight experience replay (HER) or returnsto-go in Decision Transformer (DT) -enables efficient learning of multi-task policies, where at times online RL is fully replaced by offline behavioral cloning (BC), e.g. sequence modeling. We demonstrate that all these approaches are doing hindsight information matching (HIM) -training policies that can output the rest of trajectory that matches some statistics of future state information. We present Generalized Decision Transformer (GDT) for solving any HIM problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future. For evaluating CDT and BDT, we define offline multi-task state-marginal matching (SMM) and imitation learning (IL) as two generic HIM problems, propose a Wasserstein distance loss as a metric for both, and empirically study them on MuJoCo continuous control benchmarks. Categorical DT, which simply replaces anti-causal summation with anti-causal binning in DT, enables arguably the first effective offline multi-task SMM algorithm that generalizes well to unseen (and even synthetic) multi-modal reward or state-feature distributions. Bi-directional DT, which uses an anti-causal second transformer as the aggregator, can learn to model any statistics of the future and outperforms DT variants in offline multi-task IL, i.e. one-shot IL. Our generalized formulations from HIM and GDT greatly expand the role of powerful sequence modeling architectures in modern RL.

show abstract

“…Using the tools of offline RL, it is possible to construct self-supervised RL methods that do not require any exploration on their own. Much like the "virtual play" mentioned in Section 2, we can utilize offline RL in combination with goal-conditioned policies to learn entirely from previously collected data [33,34,35]. However, major challenges remain.…”

Section: Offline Reinforcement Learningmentioning

confidence: 99%

Understanding the World Through Action

Levine¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

The recent history of machine learning research has taught us that machine learning methods can be most effective when they are provided with very large, high-capacity models, and trained on very large and diverse datasets. This has spurred the community to search for ways to remove any bottlenecks to scale. Often the foremost among such bottlenecks is the need for human effort, including the effort of curating and labeling datasets. As a result, considerable attention in recent years has been devoted to utilizing unlabeled data, which can be collected in vast quantities. However, some of the most widely used methods for training on such unlabeled data themselves require human-designed objective functions that must correlate in some meaningful way to downstream tasks. I will argue that a general, principled, and powerful framework for utilizing unlabeled data can be derived from reinforcement learning, using general purpose unsupervised or selfsupervised reinforcement learning objectives in concert with offline reinforcement learning methods that can leverage large datasets. I will discuss how such a procedure is more closely aligned with potential downstream tasks, and how it could build on existing techniques that have been developed in recent years.

show abstract

Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills

Cited by 20 publications

References 0 publications

Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

Generalized Decision Transformer for Offline Hindsight Information Matching

Understanding the World Through Action

Contact Info

Product

Resources

About