2009
DOI: 10.1016/j.tcs.2009.01.016
|View full text |Cite
|
Sign up to set email alerts
|

Exploration–exploitation tradeoff using variance estimates in multi-armed bandits

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

8
392
0
1

Year Published

2010
2010
2020
2020

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 444 publications
(401 citation statements)
references
References 9 publications
8
392
0
1
Order By: Relevance
“…For each arm i with µ i < µ * and with i pulled at least once during the episode, we will bound its total number of pulls. Parts of this follow previous UCB1 analyses (Auer, Cesa-Bianchi, & Fischer 2002;Audibert, Munos, & Szepesvári 2009).…”
Section: Multi-armed Bandit Policy Pucbsupporting
confidence: 73%
See 2 more Smart Citations
“…For each arm i with µ i < µ * and with i pulled at least once during the episode, we will bound its total number of pulls. Parts of this follow previous UCB1 analyses (Auer, Cesa-Bianchi, & Fischer 2002;Audibert, Munos, & Szepesvári 2009).…”
Section: Multi-armed Bandit Policy Pucbsupporting
confidence: 73%
“…PUCB also differs slightly from the original UCB1 in using a 3 2 constant in c(t, s) whereas UCB1 used 2; other authors have discussed bounds for UCB1 using a range of values for this constant (Audibert, Munos, & Szepesvári 2009). …”
Section: Multi-armed Bandit Policy Pucbmentioning
confidence: 99%
See 1 more Smart Citation
“…Based on the rewards, the MAB algorithm gradually learns the quality or usefulness of the subsets. We used an adversarial bandit algorithm (EXP3G (Auer et al 2002)) because our earlier experiments showed that using this bandit algorithm, the strong learner obtained via ADABOOST.MH.BA has a slight but consistently superior accuracy, while training is also faster than when using stochastic bandit algorithms, such as UCB (Auer et al 1995) and UCBV (Audibert et al 2009). The schematic overview of ADABOOST.MH.BA can be seen in Fig.…”
Section: Discussionmentioning
confidence: 99%
“…When the variance of the rewards associated with some of the actions are small, it makes sense to estimate these variances and use them in place of the range R in the above algorithm. A principled way of doing this was proposed and analyzed by Audibert et al (2009). The resulting algorithm often outperforms UCB1 and can be shown to be essentially unimprovable.…”
Section: Online Learning In Banditsmentioning
confidence: 99%