URLB: Unsupervised Reinforcement Learning Benchmark

Laskin, Michael; Yarats, Denis; Líu, Hao; Lee, Kimin; Zhan, Albert; Cang, Catherine; Pinto, Lerrel; Abbeel, Pieter

doi:10.48550/arxiv.2110.15191

Cited by 6 publications

(7 citation statements)

References 20 publications

(49 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, reward binds the agent to a certain task for which the reward represents success. Aligned with the recent surge of interest in unsupervised methods in reinforcement learning (Baranes and Oudeyer, 2013;Bellemare et al, 2016;Gregor et al, 2016;Houthooft et al, 2016;Gupta et al, 2018;Hausman et al, 2018;Pong et al, 2019;Laskin et al, 2020Laskin et al, , 2021He et al, 2021) and previously proposed ideas (Schmidhuber, 1991a(Schmidhuber, , 2010, we argue that there exist properties of a dynamical system which are not tied to any particular task, yet highly useful, leveraging them can help solve other tasks more efficiently. This work focuses on the sensitivity of the produced trajectories of the system with respect to the policy so-called Physical Derivatives.…”

Section: Introductionsupporting

confidence: 55%

Physical Derivatives: Computing policy gradients by physical forward-propagation

Mehrjou¹,

Soleymani²,

Bauer³

et al. 2022

Preprint

View full text Add to dashboard Cite

Model-free and model-based reinforcement learning are two ends of a spectrum. Learning a good policy without a dynamic model can be prohibitively expensive. Learning the dynamic model of a system can reduce the cost of learning the policy, but it can also introduce bias if it is not accurate. We propose a middle ground where instead of the transition model, the sensitivity of the trajectories with respect to the perturbation of the parameters is learned. This allows us to predict the local behavior of the physical system around a set of nominal policies without knowing the actual model. We assay our method on a custom-built physical robot in extensive experiments and show the feasibility of the approach in practice. We investigate potential challenges when applying our method to physical systems and propose solutions to each of them.

show abstract

Section: Introductionsupporting

confidence: 55%

Physical Derivatives: Computing policy gradients by physical forward-propagation

Mehrjou¹,

Soleymani²,

Bauer³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Competence-based models like DIAYN (Eysenbach et al 2019), SMM (Lee et al 2019), and APS (Liu and Abbeel 2021a) encourage agents to learn diverse skills by leveraging prior information. However, all of these methods were originally designed for online pretraining and fine-tuning (Laskin et al 2021), not tailored for data collection. In contrast, CUDC is a novel method that gradually expands the feature space by exploiting reachability into more distant future states, rather than a fixed temporal distance.…”

Section: Related Workmentioning

confidence: 99%

CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning

Sun,

Qian,

Miao

2024

AAAI

View full text Add to dashboard Cite

Offline reinforcement learning (RL) aims to learn an effective policy from a pre-collected dataset. Most existing works are to develop sophisticated learning algorithms, with less emphasis on improving the data collection process. Moreover, it is even challenging to extend the single-task setting and collect a task-agnostic dataset that allows an agent to perform multiple downstream tasks. In this paper, we propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection and ultimately improve learning efficiency and capabilities for multi-task offline RL. To achieve this, CUDC estimates the probability of the k-step future states being reachable from the current states, and adapts how many steps into the future that the dynamics model should predict. With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.

show abstract

“…In particular, Yarats et al [7] creates a dataset of pre-collected trajectories, ExoRL, on the DeepMind control suite [29] generated without any hand-crafted rewards. Similar to URLB [30], ExoRL benchmarks a number of exploration algorithms [3,6,31,5], and evaluates the performance of a policy trained on the corresponding offline datasets relabeled with task-specific rewards.…”

Section: Related Workmentioning

confidence: 99%

Learning Goal-Conditioned Policies Offline with Self-Supervised Reward Shaping

Mezghani¹,

Sukhbaatar²,

Bojanowski³

et al. 2023

Preprint

View full text Add to dashboard Cite

Developing agents that can execute multiple skills by learning from pre-collected datasets is an important problem in robotics, where online interaction with the environment is extremely time-consuming. Moreover, manually designing reward functions for every single desired skill is prohibitive. Prior works [1, 2] targeted these challenges by learning goal-conditioned policies from offline datasets without manually specified rewards, through hindsight relabeling. These methods suffer from the issue of sparsity of rewards, and fail at long-horizon tasks. In this work, we propose a novel self-supervised learning phase on the pre-collected dataset to understand the structure and the dynamics of the model, and shape a dense reward function for learning policies offline. We evaluate our method on three continuous control tasks, and show that our model significantly outperforms existing approaches [1, 2], especially on tasks that involve long-term planning.

show abstract

URLB: Unsupervised Reinforcement Learning Benchmark

Cited by 6 publications

References 20 publications

Physical Derivatives: Computing policy gradients by physical forward-propagation

Physical Derivatives: Computing policy gradients by physical forward-propagation

CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning

Learning Goal-Conditioned Policies Offline with Self-Supervised Reward Shaping

Contact Info

Product

Resources

About