2020
DOI: 10.48550/arxiv.2011.07213
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

PLAS: Latent Action Space for Offline Reinforcement Learning

Abstract: The goal of offline reinforcement learning is to learn a policy from a fixed dataset, without further interactions with the environment. This setting will be an increasingly more important paradigm for real-world applications of reinforcement learning such as robotics, in which data collection is slow and potentially dangerous. Existing off-policy algorithms have limited performance on static datasets due to extrapolation errors from out-of-distribution actions. This leads to the challenge of constraining the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(10 citation statements)
references
References 13 publications
(26 reference statements)
0
10
0
Order By: Relevance
“…The major challenge in offline RL is distribution shift [20,37,39], where the learned policy might generate out-of-distribution actions, resulting in erroneous value backups. Prior offline RL methods address this issue by regularizing the learned policy to be "close" to the behavior policy [20,50,29,82,93,37,68,57], through variants of importance sampling [59,74,49,75,54], via uncertainty quantification on Q-values [2,37,82,43], by learning conservative Q-functions [39,36], and with model-based training with a penalty on out-ofdistribution states [34,91,53,4,76,60,42,92]. While current benchmarks in offline RL [19,25] contain datasets that involve multi-task structure, existing offline RL methods do not leverage the shared structure of multiple tasks and instead train each individual task from scratch.…”
Section: Related Workmentioning
confidence: 99%
“…The major challenge in offline RL is distribution shift [20,37,39], where the learned policy might generate out-of-distribution actions, resulting in erroneous value backups. Prior offline RL methods address this issue by regularizing the learned policy to be "close" to the behavior policy [20,50,29,82,93,37,68,57], through variants of importance sampling [59,74,49,75,54], via uncertainty quantification on Q-values [2,37,82,43], by learning conservative Q-functions [39,36], and with model-based training with a penalty on out-ofdistribution states [34,91,53,4,76,60,42,92]. While current benchmarks in offline RL [19,25] contain datasets that involve multi-task structure, existing offline RL methods do not leverage the shared structure of multiple tasks and instead train each individual task from scratch.…”
Section: Related Workmentioning
confidence: 99%
“…The main challenge of offline RL is the distributional shift between the learned policy and the behavior policy (Fujimoto et al, 2018), which can cause erroneous value backups. To address this issue, prior methods have constrained the learned policy to be close to the behavior policy via policy regularization (Liu et al, 2020;Wu et al, 2019;Kumar et al, 2019;Zhou et al, 2020;Ghasemipour et al, 2021;Fujimoto & Gu, 2021), conservative value functions (Kumar et al, 2020), and model-based training with conservative penalties (Yu et al, 2020c;Kidambi et al, 2020;Swazinna et al, 2020;Lee et al, 2021;Yu et al, 2021b). Unlike these prior works, we study how unlabeled data can be incorporated into the offline RL framework.…”
Section: Related Workmentioning
confidence: 99%
“…The use of generative models in robot learning has become popular in recent years [2,3,[11][12][13][42][43][44][45][46][47][48][49][50][51][52] because of their low-dimensional and regularized latent spaces. However, latent variable generative models are mainly studied to train a long-term state prediction model that is used in the context of trajectory optimization and model-based reinforcement learning [47][48][49][50][51][52].…”
Section: Related Workmentioning
confidence: 99%