Exploration–exploitation tradeoff using variance estimates in multi-armed bandits

Audibert, Jean-Yves; Munos, Rémi; Szepesvári, Csaba

doi:10.1016/j.tcs.2009.01.016

Cited by 444 publications

(401 citation statements)

References 9 publications

Supporting

Mentioning

392

Contrasting

Unclassified

Order By: Relevance

“…For each arm i with µ i < µ * and with i pulled at least once during the episode, we will bound its total number of pulls. Parts of this follow previous UCB1 analyses (Auer, Cesa-Bianchi, & Fischer 2002;Audibert, Munos, & Szepesvári 2009).…”

Section: Multi-armed Bandit Policy Pucbsupporting

confidence: 73%

“…PUCB also differs slightly from the original UCB1 in using a 3 2 constant in c(t, s) whereas UCB1 used 2; other authors have discussed bounds for UCB1 using a range of values for this constant (Audibert, Munos, & Szepesvári 2009). …”

Section: Multi-armed Bandit Policy Pucbmentioning

confidence: 99%

“…This includes the "pure exploration" bandit problem relevant at the root of a search tree (Bubeck, Munos, & Stoltz 2009), studies of bandits with large numbers of arms (Teytaud, Gelly, & Sebag 2007), and fundamental improvements in UCB-style bandit algorithms and regret bounds (Audibert, Munos, & Szepesvári 2009;Audibert & Bubeck 2009). But computer Go has progressed further with trial-and-error development of more complex heuristics (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-armed bandits with episode context

Rosin¹

2011

Ann Math Artif Intell

137

View full text Add to dashboard Cite

A multi-armed bandit episode consists of n trials, each allowing selection of one of K arms, resulting in payoff from a distribution over [0, 1] associated with that arm. We assume contextual side information is available at the start of the episode. This context enables an arm predictor to identify possible favorable arms, but predictions may be imperfect so that they need to be combined with further exploration during the episode. Our setting is an alternative to classical multiarmed bandits which provide no contextual side information, and is also an alternative to contextual bandits which provide new context each individual trial. Multi-armed bandits with episode context can arise naturally, for example in computer Go where context is used to bias move decisions made by a multi-armed bandit algorithm. The UCB1 algorithm for multi-armed bandits achieves worstcase O( p Kn log(n)) regret. We seek to improve this using episode context, particularly in the case where K is large. Using a predictor that places weight Mi > 0 on arm i with weights summing to 1, we present the PUCB algorithm which achieves regret O( 1 M * p n log(n)) where M * is the weight on the optimal arm. We also discuss methods for obtaining suitable predictors for use with PUCB.

show abstract

Section: Multi-armed Bandit Policy Pucbsupporting

confidence: 73%

Section: Multi-armed Bandit Policy Pucbmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-armed bandits with episode context

Rosin¹

2011

Ann Math Artif Intell

137

View full text Add to dashboard Cite

show abstract

“…Based on the rewards, the MAB algorithm gradually learns the quality or usefulness of the subsets. We used an adversarial bandit algorithm (EXP3G (Auer et al 2002)) because our earlier experiments showed that using this bandit algorithm, the strong learner obtained via ADABOOST.MH.BA has a slight but consistently superior accuracy, while training is also faster than when using stochastic bandit algorithms, such as UCB (Auer et al 1995) and UCBV (Audibert et al 2009). The schematic overview of ADABOOST.MH.BA can be seen in Fig.…”

Section: Discussionmentioning

confidence: 99%

Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers

et al. 2013

View full text Add to dashboard Cite

In subset ranking, the goal is to learn a ranking function that approximates a gold standard partial ordering of a set of objects (in our case, a set of documents retrieved for the same query). The partial ordering is given by relevance labels representing the relevance of documents with respect to the query on an absolute scale. Our approach consists of three simple steps. First, we train standard multi-class classifiers (AdaBoost.MH and multi-class SVM) to discriminate between the relevance labels. Second, the posteriors of multi-class classifiers are calibrated using probabilistic and regression losses in order to estimate the Bayes-scoring function which optimizes the Normalized Discounted Cumulative Gain (NDCG). In the third step, instead of selecting the best multi-class hyperparameters and the best calibration, we mix all the learned models in a simple ensemble scheme. Mach Learn (2013) 93:261-292 Our extensive experimental study is itself a substantial contribution. We compare most of the existing learning-to-rank techniques on all of the available large-scale benchmark data sets using a standardized implementation of the NDCG score. We show that our approach is competitive with conceptually more complex listwise and pairwise methods, and clearly outperforms them as the data size grows. As a technical contribution, we clarify some of the confusing results related to the ambiguities of the evaluation tools, and propose guidelines for future studies.

show abstract

“…When the variance of the rewards associated with some of the actions are small, it makes sense to estimate these variances and use them in place of the range R in the above algorithm. A principled way of doing this was proposed and analyzed by Audibert et al (2009). The resulting algorithm often outperforms UCB1 and can be shown to be essentially unimprovable.…”

Section: Online Learning In Banditsmentioning

confidence: 99%