New Insights into Bootstrapping for Bandits

Vaswani, Sharan; Kveton, Branislav; Wen, Zheng; Rao, Anup; Schmidt, Mark; Abbasi-Yadkori, Yasin

doi:10.48550/arxiv.1805.09793

Cited by 7 publications

(12 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Following the augments in Vaswani et al (2018); Kveton et al (2018), in this section, we show that UCB with a naive bootstrapped confidence bound will result in linear regret in two-armed Bernoulli bandit. At round t + 1, the UCB index without the correction term for arm k can be written as…”

Section: A Linear Regretmentioning

confidence: 86%

Bootstrapping Upper Confidence Bound

Hao¹,

Abbasi-Yadkori²,

Wen³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Upper Confidence Bound (UCB) method is arguably the most celebrated one used in online decision making with partial information feedback. Existing techniques for constructing confidence bounds are typically built upon various concentration inequalities, which thus lead to over-exploration. In this paper, we propose a non-parametric and data-dependent UCB algorithm based on the multiplier bootstrap. To improve its finite sample performance, we further incorporate second-order correction into the above construction. In theory, we derive both problem-dependent and problem-independent regret bounds for multi-armed bandits under a much weaker tail assumption than the standard sub-Gaussianity. Numerical results demonstrate significant regret reductions by our method, in comparison with several baselines in a range of multi-armed and linear bandit problems.

show abstract

Section: A Linear Regretmentioning

confidence: 86%

Bootstrapping Upper Confidence Bound

Hao¹,

Abbasi-Yadkori²,

Wen³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…[11,31] use bootstrap in the posterior distribution of Thompson sampling to improve the computational efficiency. In addition, the bootstrap can be used to learn model coefficients in contextual bandits [40], achieve near-optimal regret [44], approximate Thompson sampling [15] and conduct a generally well-performed algorithm in different models [23]. These papers apply bootstrap as a component in the online algorithm, which is different from our work.…”

Section: Related Workmentioning

confidence: 99%

Debiasing Samples from Online Learning Using Bootstrap

Chen¹,

Gao²,

Xiong³

2021

Preprint

View full text Add to dashboard Cite

It has been recently shown in the literature [30,35,36] that the sample averages from online learning experiments are biased when used to estimate the mean reward. To correct the bias, off-policy evaluation methods, including importance sampling and doubly robust estimators, typically calculate the propensity score, which is unavailable in this setting due to unknown reward distribution and the adaptive policy. This paper provides a procedure to debias the samples using bootstrap, which doesn't require the knowledge of the reward distribution at all. Numerical experiments demonstrate the effective bias reduction for samples generated by popular multi-armed bandit algorithms such as Explore-Then-Commit (ETC), UCB, Thompson sampling and ǫ-greedy. We also analyze and provide theoretical justifications for the procedure under the ETC algorithm, including the asymptotic convergence of the bias decay rate in the real and bootstrap worlds.

show abstract

“…Boltzmann policy computes the softmax over predicted rewards of candidate actions to derive a stochastic policy, which is also shown competitive in [26]. Bootstrap-based exploration is also shown effective in both reinforcement learning [24,27] and bandit problems [6,7,20,31,32] , which either maintain multiple bootstrap samples of the history or train multiple reward models from different subsets of data. These methods can be incorporated with deep neural networks, therefore can achieve state-of-the-art performance in deep contextual bandits.…”

Section: Non-bayesian Approachesmentioning

confidence: 99%

“…Osband and Van Roy [25] proposed a bandit algorithm named BootstrapThompson and showed the algorithm approximates Thompson sampling in Bernoulli bandits. Vaswani et al [32] generalized it to categorical and Gaussian rewards. Hao et al [13] extended UCB with multiplier bootstrap and derived both problem-dependent and problem-independent regret bounds for the proposed algorithm.…”

Section: Other Related Workmentioning

confidence: 99%

GuideBoot: Guided Bootstrap for Deep Contextual Banditsin Online Advertising

Pan

et al. 2021

Proceedings of the Web Conference 2021

View full text Add to dashboard Cite

The exploration/exploitation (E&E) dilemma lies at the core of interactive systems such as online advertising, for which contextual bandit algorithms have been proposed. Bayesian approaches provide guided exploration via uncertainty estimation, but the applicability is often limited due to over-simplified assumptions. Non-Bayesian bootstrap methods, on the other hand, can apply to complex problems by using deep reward models, but lack a clear guidance to the exploration behavior. It still remains largely unsolved to develop a practical method for complex deep contextual bandits.In this paper, we introduce Guided Bootstrap (GuideBoot), combining the best of both worlds. GuideBoot provides explicit guidance to the exploration behavior by training multiple models over both real samples and noisy samples with fake labels, where the noise is added according to the predictive uncertainty. The proposed method is efficient as it can make decisions on-the-fly by utilizing only one randomly chosen model, but is also effective as we show that it can be viewed as a non-Bayesian approximation of Thompson sampling. Moreover, we extend it to an online version that can learn solely from streaming data, which is favored in real applications. Extensive experiments on both synthetic tasks and large-scale advertising environments show that GuideBoot achieves significant improvements against previous state-of-the-art methods.

show abstract

New Insights into Bootstrapping for Bandits

Cited by 7 publications

References 0 publications

Bootstrapping Upper Confidence Bound

Bootstrapping Upper Confidence Bound

Debiasing Samples from Online Learning Using Bootstrap

GuideBoot: Guided Bootstrap for Deep Contextual Banditsin Online Advertising

Contact Info

Product

Resources

About