2021
DOI: 10.48550/arxiv.2102.09225
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Continuous Doubly Constrained Batch Reinforcement Learning

Abstract: Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 24 publications
(37 reference statements)
0
4
0
Order By: Relevance
“…In practice, we can use the KL divergence to replace the total variation distance between policies, based on Pinsker's Inequality: π 1 −π 2 ≤ 2D KL (π 1 ||π 2 ). Moreover, since the behavior policy π β,n is typically unknown, we can use the reverse KL-divergence between π n and π β,n to circumvent the estimation of π β,n , following the same line as in (Fakoor et al, 2021):…”
Section: Safe Policy Improvement With Meta-regularizationmentioning
confidence: 99%
“…In practice, we can use the KL divergence to replace the total variation distance between policies, based on Pinsker's Inequality: π 1 −π 2 ≤ 2D KL (π 1 ||π 2 ). Moreover, since the behavior policy π β,n is typically unknown, we can use the reverse KL-divergence between π n and π β,n to circumvent the estimation of π β,n , following the same line as in (Fakoor et al, 2021):…”
Section: Safe Policy Improvement With Meta-regularizationmentioning
confidence: 99%
“…Insufficient coverage of the dataset due to the lack of online exploration is known as the main challenge in offline RL . To deal with this problem, a number of methods have been recently proposed from both model-free (Wu et al, 2019;Touati et al, 2020;Liu et al, 2020;Rezaeifar et al, 2021;Fujimoto et al, 2019;Fakoor et al, 2021) and model-based perspectives (Yu et al, 2020;Kidambi et al, 2020;Matsushima et al, 2020;Yin et al, 2021). More or less, their methods rely on the idea of pessimism and its variants in the sense that the learned policy can avoid uncertain regions not covered by offline data.…”
Section: Related Workmentioning
confidence: 99%
“…Offline RL In offline RL, algorithms such as FQI (Ernst et al, 2005) have finite-sample error guarantees under the global coverage Antos et al, 2008). Recently, many algorithms to tackle this problem have been proposed from both model-free (Wu et al, 2019;Touati et al, 2020;Liu et al, 2020;Fujimoto et al, 2019;Fakoor et al, 2021;Kumar et al, 2020) and model-based perspectives Kidambi et al, 2020;Matsushima et al, 2020) with some pessimism ideas. The idea of pessimism features in offline RL with an eye to penalize the learner from visiting unknown regions of the state-action space (Rashidinejad et al, 2021;Yin et al, 2021;Buckman et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…The first other way is imposing constraints on the policy class or Q-function class so that estimated policies are not too much far away from behavior policies. For example, we can use KL divegences, MMD distance, Wasserstein distance to measure the distance from behavior policies (Wu et al, 2019;Fakoor et al, 2021;Matsushima et al, 2020;Touati et al, 2020;Fujimoto et al, 2019) and add D(π, π b ) as penalty terms, where π b is a behavior policy. Another way is explicitly estimating the lower bound of q-functions (Kumar et al, 2020;.…”
Section: Now We Only Need To Focus On Analyzingmentioning
confidence: 99%