2012
DOI: 10.1007/978-3-642-34106-9_18
|View full text |Cite
|
Sign up to set email alerts
|

Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis

Abstract: The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

6
363
1
5

Year Published

2013
2013
2021
2021

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 342 publications
(375 citation statements)
references
References 8 publications
(7 reference statements)
6
363
1
5
Order By: Relevance
“…Both UCB, kl-UCB and TS have been proved to have logarithmic regrets [14], [15], [16], meaning that R…”
Section: Classical Mab Algorithms : Ucb Kl-ucb Tsmentioning
confidence: 99%
“…Both UCB, kl-UCB and TS have been proved to have logarithmic regrets [14], [15], [16], meaning that R…”
Section: Classical Mab Algorithms : Ucb Kl-ucb Tsmentioning
confidence: 99%
“…The theoretical analysis arguing that TS is a viable solution method to MAB problems is an active area of research (Kaufmann et al 2012;Ortega and Braun 2010;Russo and VanRoy 2014). While it may seem like a simple heuristic, TS has been shown to be an optimal policy with respect to minimizing finite-time regret (Agrawal and Goyal 2012;Kaufmann et al 2012), minimizing relative entropy (Ortega and Braun 2013), and min- The TS allocation rule encodes model uncertainty by drawing samples from the posterior.…”
Section: Optimization Problemmentioning
confidence: 99%
“…In the case of the standard SMAB problem, bandit algorithms following the Bayesian approach, namely, Bayesian-UCB [19] and Thompson sampling [20], have stronger theoretically grounded guarantees and achieved better performance in experiments with other tasks [19,20] than the pointwise alternatives such as UCB-1. However, to the best of our knowledge, they have not been applied to the OLREE problem earlier.…”
Section: Bayesian Banditsmentioning
confidence: 99%
“…In the case of α low (t) = αup(t), this algorithm coincides with Bayesian-UCB [19], and if α low (t) = 0, αup(t) = 1, it reduces to Thompson sampling [20]. Note that we can store {γa,t, Wa,t, {pa,0(r)} r∈[0,1] } instead of {pa,t(r)} r∈ [0,1] and update only γa,t and Wa,t when we run Bayesian bandits in practice.…”
Section: Bayesian Banditsmentioning
confidence: 99%