Optimal discovery with probabilistic expert advice

Bubeck, Sébastien; Ernst, Damien; Garivier, Aurélien

doi:10.1109/cdc.2012.6426724

Cited by 1 publication

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many new domains of application for bandits problems are currently investigated. For example: multichannel opportunistic communications Liu et al [2010], model selection Agarwal et al [2011a], boosting Busa-Fekete and Kegl [2011], management of dark pools of liquidity (a recent type of stock exchange) Agarwal et al [2010a], security analysis of power systems Bubeck et al [2011a].…”

Section: Discussionmentioning

confidence: 99%

“…A simple strategy that attains this rate, based on the Successive Elimination algorithm of Even-Dar et al [2002], was proposed by Yue and Joachims [2011]. Bubeck et al [2011a] study a model with a stochastic bandit flavor (in fact it can be cast as an MDP), where the key for the analysis is a sort 7.5. Many-armed bandits 109 of 'non-linear' regret bound.…”

Section: Dueling Banditsmentioning

confidence: 99%

“…Taking into account these issues, it turns out that it is easier to show that for good strategies, F (n) is not too far from F * (n ′ ), where n ′ is not much smaller than n. Such a statement -which can be interepreted as a non-linear regret bound -shows that the analyzed strategy slightly 'lags' behind the optimal strategy. In Bubeck et al [2011a] a non-linear regret bound is derived for an algorithm based on estimating the mass of interesting items left on each arm (the so-called Good-Turing estimator), combined with the optimism in face of uncertainty principle of Chapter 2. We refer the reader to Bubeck et al [2011a] for more precise statements.…”

Section: Dueling Banditsmentioning

confidence: 99%

“…In Bubeck et al [2011a] a non-linear regret bound is derived for an algorithm based on estimating the mass of interesting items left on each arm (the so-called Good-Turing estimator), combined with the optimism in face of uncertainty principle of Chapter 2. We refer the reader to Bubeck et al [2011a] for more precise statements.…”

Section: Dueling Banditsmentioning

confidence: 99%

See 3 more Smart Citations

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Bubeck

Cesa-Bianchi

2012

FNT in Machine Learning

1,044

501

View full text Add to dashboard Cite

Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the Thirties, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this survey, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model. 7 Variants 104 7.1 Markov Decision Processes, restless and sleeping bandits 105 7.2 Pure exploration problems 106 7.3 Dueling bandits 108 7.4 Discovery with probabilistic expert advice 108 7.5 Many-armed bandits 109 7.6 Truthful bandits 110 7.7 Concluding remarks 110 Acknowledgements 112by Foster and Vohra [1998] and Hart and Mas-Colell [2000, 2001]. Approximately at the same time, the problem was re-discovered in computer science by Auer et al. [2002b]. It was them who made apparent the connection to stochastic bandits by coining the term nonstochastic multi-armed bandit problem. The third fundamental model of multi-armed bandits assumes that the reward processes are neither i.i.d. (like in stochastic bandits) nor adversarial. More precisely, arms are associated with K Markov processes, each with its own state space. Each time an arm i is chosen in state s, a stochastic reward is drawn from a probability distribution ν i,s , and the state of the reward process for arm i changes in a Markovian fashion, based on an underlying stochastic transition matrix M i . Both reward and new state are revealed to the player. On the other hand, the state of arms that are not chosen remains unchanged. Going back to our initial interpretation of bandits as sequential resource allocation processes, here we may think of K competing projects that are sequentially allocated a unit resource of work. However, unlike the previous bandit models, in this case the state of a project that gets the resource may change. Moreover, the underlying stochastic transition matrices M i are typically assumed to be known, thus the optimal policy can be computed via dynamic programming and the problem is essentially of computational nature. The seminal result of Gittins [1979] provides an optimal greedy policy which can be computed efficiently.A notable special case of Markovian bandits is that of Bayesian bandits. These are parametric stochastic bandits, where the parameters of the reward distributions are assumed to be drawn from known priors, and the regret is computed by also averaging over the draw of parameters from the prior. The Markovian state change...

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Dueling Banditsmentioning

confidence: 99%