Optimal Adaptive Policies for Sequential Allocation Problems

Burnetas, Apostolos; Katehakis, Michael N.

doi:10.1006/aama.1996.0007

Cited by 160 publications

(167 citation statements)

References 24 publications

Supporting

Mentioning

158

Contrasting

Unclassified

Order By: Relevance

“…We show however that this is possible. The constant before the logarithmic term consists of the ratio ∆a Ka that is also very similar to the known bounds for the expected regret ( [6,22]), up to the constant c, that could definitely be reduced by a more careful analysis and parameter tuning (this is not the main focus of this work), and more importantly the constant 1 + ǫ a . Theorem 1 holds for a larger class of distributions than the one considered e.g.…”

Section: Theoremsupporting

confidence: 53%

Robust Risk-Averse Stochastic Multi-armed Bandits

Maillard

2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. We study a variant of the standard stochastic multi-armed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximising some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative expected regret. We provide an algorithm, called RA-UCB to solve this problem, together with a high probability bound on its regret.

show abstract

Section: Theoremsupporting

confidence: 53%

Robust Risk-Averse Stochastic Multi-armed Bandits

Maillard

2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Robbins' results were also obtained by Yakowitz and Lowe (1991), and by Burnetas and Katehakis (1996).…”

Section: Acknowledgmentsmentioning

confidence: 65%

Untitled

2002

View full text Add to dashboard Cite

Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

show abstract

“…13) times (which, from (1. Burnetas and Katehakis [37] extended this result to several classes P of multi-dimensional parametric distributions. By writing…”

Section: Lower Boundsmentioning

confidence: 83%

“…There are two types of lower bounds: (1) The problem-dependent bounds [81,37] which say that for a given problem, any "admissible" algorithm will suffer -asymptotically-a logarithmic regret with a constant factor that depends on the arm distributions. (2) The problemindependent bounds [41,30] which states that for any algorithm and any time-horizon n, there exists an environment on which this algorithm will have a regret at least of order √ Kn.…”

Section: Lower Boundsmentioning

confidence: 99%

See 1 more Smart Citation

From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning

Munos

2014

FNT in Machine Learning

186

181

View full text Add to dashboard Cite

This work covers several aspects of the optimism in the face of uncertainty principle applied to large scale optimization problems under finite numerical budget. The initial motivation for the research reported here originated from the empirical success of the so-called Monte-Carlo Tree Search method popularized in Computer Go and further extended to many other games as well as optimization and planning problems. Our objective is to contribute to the development of theoretical foundations of the field by characterizing the complexity of the underlying optimization problems and designing efficient algorithms with performance guarantees. The main idea presented here is that it is possible to decompose a complex decision making problem (such as an optimization problem in a large search space) into a sequence of elementary decisions, where each decision of the sequence is solved using a (stochastic) multi-armed bandit (simple mathematical model for decision making in stochastic environments). This so-called hierarchical bandit approach (where the reward observed by a bandit in the hierarchy is itself the return of another bandit at a deeper level) possesses the nice feature of starting the exploration by a quasi-uniform sampling of the space and then focusing progressively on the most promising area, at different scales, according to the evaluations observed so far, and eventually performing a local search around the global optima of the function. The performance of the method is assessed in terms of the optimality of the returned solution as a function of the number of function evaluations. Our main contribution to the field of function optimization is a class of hierarchical optimistic algorithms designed for general search spaces (such as metric spaces, trees, graphs, Euclidean spaces, ...) with different algorithmic instantiations depending on whether the evaluations are noisy or noiseless and whether some measure of the "smoothness" of the function is known or unknown. The performance of the algorithms depends on the local behavior of the function around its global optima expressed in terms of the quantity of near-optimal states measured with some metric. If this local smoothness of the function is known then one can design very efficient optimization algorithms (with convergence rate independent of the space dimension), and when it is not known, we can build adaptive techniques that can, in some cases, perform almost as well as when it is known. In order to be self-contained, we start with a brief introduction to the stochastic multi-armed bandit problem in Chapter 1 and describe the UCB (Upper Confidence Bound) strategy and several extensions. In Chapter 2 we present the Monte-Carlo Tree Search method applied to Computer Go and show the limitations of previous algorithms such as UCT (UCB applied to Trees). This provides motivation for designing theoretically well-founded optimistic optimization algorithms. The main contributions on hierarchical optimistic optimization are described in Chapters 3 and 4...

show abstract

Optimal Adaptive Policies for Sequential Allocation Problems

Cited by 160 publications

References 24 publications

Robust Risk-Averse Stochastic Multi-armed Bandits

Robust Risk-Averse Stochastic Multi-armed Bandits

Untitled

From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning

Contact Info

Product

Resources

About