2018
DOI: 10.48550/arxiv.1805.09793
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

New Insights into Bootstrapping for Bandits

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
6

Relationship

3
3

Authors

Journals

citations
Cited by 7 publications
(12 citation statements)
references
References 0 publications
0
12
0
Order By: Relevance
“…Following the augments in Vaswani et al (2018); Kveton et al (2018), in this section, we show that UCB with a naive bootstrapped confidence bound will result in linear regret in two-armed Bernoulli bandit. At round t + 1, the UCB index without the correction term for arm k can be written as…”
Section: A Linear Regretmentioning
confidence: 86%
“…Following the augments in Vaswani et al (2018); Kveton et al (2018), in this section, we show that UCB with a naive bootstrapped confidence bound will result in linear regret in two-armed Bernoulli bandit. At round t + 1, the UCB index without the correction term for arm k can be written as…”
Section: A Linear Regretmentioning
confidence: 86%
“…[11,31] use bootstrap in the posterior distribution of Thompson sampling to improve the computational efficiency. In addition, the bootstrap can be used to learn model coefficients in contextual bandits [40], achieve near-optimal regret [44], approximate Thompson sampling [15] and conduct a generally well-performed algorithm in different models [23]. These papers apply bootstrap as a component in the online algorithm, which is different from our work.…”
Section: Related Workmentioning
confidence: 99%
“…Boltzmann policy computes the softmax over predicted rewards of candidate actions to derive a stochastic policy, which is also shown competitive in [26]. Bootstrap-based exploration is also shown effective in both reinforcement learning [24,27] and bandit problems [6,7,20,31,32] , which either maintain multiple bootstrap samples of the history or train multiple reward models from different subsets of data. These methods can be incorporated with deep neural networks, therefore can achieve state-of-the-art performance in deep contextual bandits.…”
Section: Non-bayesian Approachesmentioning
confidence: 99%
“…Osband and Van Roy [25] proposed a bandit algorithm named BootstrapThompson and showed the algorithm approximates Thompson sampling in Bernoulli bandits. Vaswani et al [32] generalized it to categorical and Gaussian rewards. Hao et al [13] extended UCB with multiplier bootstrap and derived both problem-dependent and problem-independent regret bounds for the proposed algorithm.…”
Section: Other Related Workmentioning
confidence: 99%