2021
DOI: 10.48550/arxiv.2104.07749
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
44
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3

Relationship

2
7

Authors

Journals

citations
Cited by 20 publications
(44 citation statements)
references
References 0 publications
0
44
0
Order By: Relevance
“…In the preceding deterministic MDP formulation, we aim at solving a goal-reaching RL problem (Kaelbling, 1993b;Sutton et al, 2011;Andrychowicz et al, 2017;Andreas et al, 2017;Pong et al, 2018;Ghosh et al, 2019;Eysenbach et al, 2020aEysenbach et al, , 2020bKadian et al, 2020;Fujita et al, 2020;Chebotar et al, 2021;Khazatsky et al, 2021) or a planning problem (Bertsekas & Tsitsiklis, 1996;Boutilier et al, 1999;Sutton et al, 1999;Boutilier et al, 2000;Rintanen & Hoffmann, 2001;LaValle, 2006;Russell & Norvig, 2009;Nasiriany et al, 2019). We say a Q-function is successful if its associated greedy policy (Sutton & Barto, 2018)…”
Section: Successful Q-functionsmentioning
confidence: 99%
“…In the preceding deterministic MDP formulation, we aim at solving a goal-reaching RL problem (Kaelbling, 1993b;Sutton et al, 2011;Andrychowicz et al, 2017;Andreas et al, 2017;Pong et al, 2018;Ghosh et al, 2019;Eysenbach et al, 2020aEysenbach et al, , 2020bKadian et al, 2020;Fujita et al, 2020;Chebotar et al, 2021;Khazatsky et al, 2021) or a planning problem (Bertsekas & Tsitsiklis, 1996;Boutilier et al, 1999;Sutton et al, 1999;Boutilier et al, 2000;Rintanen & Hoffmann, 2001;LaValle, 2006;Russell & Norvig, 2009;Nasiriany et al, 2019). We say a Q-function is successful if its associated greedy policy (Sutton & Barto, 2018)…”
Section: Successful Q-functionsmentioning
confidence: 99%
“…"relabel"-able, like goal reaching rewards, and (2) if policy or Q-functions are smooth with respect to the reward parameter, generalization can speed up learning even with respect to "unexplored" rewards. In goal-based RL where future states can inform "optimal" reward parameters with respect to the transitions' actions, hindsight methods were applied successfully to enable effective training of goal-based Q-function for sparse rewards , derive exact connections between Q-learning and classic model-based RL , dataefficient off-policy hierarchical RL (Nachum et al, 2018), multi-task RL Li et al, 2020), offline RL (Chebotar et al, 2021), and more Choi et al, 2021;Ren et al, 2019;Zhao & Tresp, 2018;Ghosh et al, 2021;Nasiriany et al, 2021). Additionally, Lynch et al (2019) and Gupta et al (2018) have shown that often BC is sufficient for learning generalizable parameterized policies, due to rich positive examples from future states, and most recently Chen et al (2021a) and Janner et al (2021), when combined with powerful transformer architectures (Vaswani et al, 2017), it produced state-of-the-art offline RL and goal-based RL results.…”
Section: Related Workmentioning
confidence: 99%
“…Orthogonal to these, in the recent years we have seen a number of algorithms that are derived from different motivations and frameworks, but share the following common trait: they use future trajectory information τ t:T to accelerate optimization of a contextual policy π(a t |s t , z) with context z with respect to a parameterized reward function r(s t , a t , z) (see Section 3 for notations). These hindsight algorithms have enabled Q-learning with sparse rewards , temporally-extended model-based RL with Q-function , mastery of 6-DoF object manipulation in cluttered scenes from human play (Lynch et al, 2019), efficient multi-task RL Li et al, 2020), offline self-supervised discovery of manipulation primitives from pixels (Chebotar et al, 2021), and offline RL using return-conditioned supervised learning with transformers (Chen et al, 2021a;Janner et al, 2021). We derive a generic problem formulation covering all these variants, and observe that this hindsight information matching (HIM) framework, with behavioral cloning (BC) as the learning objective, can learn a conditional policy to generate trajectories that each satisfy any properties, including distributional.…”
Section: Introductionmentioning
confidence: 99%
“…Using the tools of offline RL, it is possible to construct self-supervised RL methods that do not require any exploration on their own. Much like the "virtual play" mentioned in Section 2, we can utilize offline RL in combination with goal-conditioned policies to learn entirely from previously collected data [33,34,35]. However, major challenges remain.…”
Section: Offline Reinforcement Learningmentioning
confidence: 99%