2017
DOI: 10.48550/arxiv.1712.06924
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Safe Policy Improvement with Baseline Bootstrapping

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(9 citation statements)
references
References 0 publications
0
9
0
Order By: Relevance
“…On the other hand, for model-free methods, Sutton and Barto [23] identify a deadly triad of function approximation, bootstrapping, and off-policy learning. It emphasizes that function approximation equipped with Q-learning can diverge in the off-policy learning setting, making most off-policy RL methods very conservative in extrapolation because of the severe error-propagation issue [7,16]. In this work, we focus on the contextual bandit problem, which has many direct applications in recommender systems and online search.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…On the other hand, for model-free methods, Sutton and Barto [23] identify a deadly triad of function approximation, bootstrapping, and off-policy learning. It emphasizes that function approximation equipped with Q-learning can diverge in the off-policy learning setting, making most off-policy RL methods very conservative in extrapolation because of the severe error-propagation issue [7,16]. In this work, we focus on the contextual bandit problem, which has many direct applications in recommender systems and online search.…”
Section: Related Workmentioning
confidence: 99%
“…For each fixed (u 1 , u 2 ) pair, the inner minimization objective is ERM with IPS where the reward is shifted by k = (u 1 − u 2 ). So, we can select k, solve (16), and compute the corresponding κ afterwards.…”
Section: Safe Learning By Restricting the Policy Spacementioning
confidence: 99%
“…These failure cases are hypothesized to be caused by erroneous generalization of the state-action value function (Q-value function) learned with function approximators, as suggested by Sutton (1995); Baird (1995); Tsitsiklis & Van Roy (1997); Van Hasselt et al (2018). To remedy this issue, two types of approaches have been proposed recently: 1) Agarwal et al (2019) proposes to apply a random ensemble of Q-value targets to stabilize the learned Q-function, 2) Fujimoto et al (2018a); ; Jaques et al (2019); Laroche & Trichelair (2017) propose to regularize the learned policy towards the behavior policy based on the intuition that unseen state-action pairs are more likely to receive overestimated Q-values. These proposed remedies have been shown to improve upon DQN or DDPG at performing policy improvement based on offline data.…”
Section: Introductionmentioning
confidence: 99%
“…Model-free methods constitute the majority of offline RL algorithms in current RL literature. Among the first algorithms that considered the problem now known as offline RL -with no environment interaction and learning only from a static dataset that was collected under a baseline policy -was Safe Policy Improvement under Baseline Bootstrapping (SPIBB) (Laroche et al, 2017). It was designed for discrete actions and assumes that the baseline policy is passed to the learning algorithm as an input.…”
Section: Related Workmentioning
confidence: 99%