Multi-Armed bandit problem revisited

Ishikida, Takashi; Varaiya, Pravin

doi:10.1007/bf02191765

Cited by 33 publications

(15 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are methods, like the Gittins allocation indices, that allow to find the optimal machine to play at each time n by considering each reward process independently from the others (even though the globally optimal solution depends on all the processes). However, computation of the Gittins indices for the average (undiscounted) reward criterion used here requires preliminary knowledge about the reward processes (see, e.g., Ishikida & Varaiya, 1994). To overcome this requirement, one can learn the Gittins indices, as proposed in Duff (1995) for the case of finite-state Markovian reward processes.…”

Section: Discussionmentioning

confidence: 99%

Untitled

2002

View full text Add to dashboard Cite

Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

show abstract

Section: Discussionmentioning

confidence: 99%

Untitled

2002

View full text Add to dashboard Cite

show abstract

“…As a remark, note that a deterministic bandit problem was also considered by Gittins [9] and Ishikida and Varaiya [13]. However, their version of the bandit problem is very different from ours: they assume that the player can compute ahead of time exactly what payoffs will be received from each arm, and their problem is thus one of optimization, rather than exploration and exploitation.…”

Section: Introductionmentioning

confidence: 94%

The Nonstochastic Multiarmed Bandit Problem

et al. 2002

View full text Add to dashboard Cite

Abstract. In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines.In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the per-round payoff of our algorithm approaches that of the best arm at the rate O(T −1/2 ). We show by a matching lower bound that this is the best possible.We also prove that our algorithm approaches the per-round payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of N strategies, then our algorithm approaches the per-round payoff of the strategy at the rate O((log N ) 1/2 T −1/2 ). Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate O(T −1/2 ).

show abstract

“…Similar ideas were also used by Mandelbaum [12] and by Varaiya et al [17,9]. We now consider N bandit processes, with initial state Z(0) = i.…”

Section: Second Proof: Interleaving Of Prevailing Chargesmentioning

confidence: 92%

“…We can now construct a sequence of stopping times, with (9) which will continue indefinitely, or will reach IP(σ n0 = τ (i)) = 1, in which case we define σ n = τ (i), n > n 0 .…”

Section: Theorem 2 the Supremum Ofmentioning

confidence: 99%