Proceedings of IEEE 36th Annual Foundations of Computer Science
DOI: 10.1109/sfcs.1995.492488
|View full text |Cite
|
Sign up to set email alerts
|

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Abstract: In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-o between exploration (trying out each arm to nd the best one) and exploitation (playing the arm believed to give the best payo ). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
586
1
4

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 484 publications
(594 citation statements)
references
References 13 publications
3
586
1
4
Order By: Relevance
“…Assuming each arm in a slot machine has a different distribution of rewards, the goal is to find out the arm with the best expected return as early as possible and then to keep using that specific arm. The problem is a classical example of the trade-off between exploration and exploitation [13]: On the one hand, if the gambler plays exclusively on the machine which the gambler supposes to be the best one ("exploitation"), then the gambler may fail to discover that one of the other arms, in fact, has a higher average return. On the other hand, if the gambler spends too much time trying out all K machines and then makes a decision based on the gathered statistics ("exploration"), then the gambler may fail to play the best arm for long enough a period of time to get a high total return.…”
Section: Multi-armed Bandit Problemmentioning
confidence: 99%
“…Assuming each arm in a slot machine has a different distribution of rewards, the goal is to find out the arm with the best expected return as early as possible and then to keep using that specific arm. The problem is a classical example of the trade-off between exploration and exploitation [13]: On the one hand, if the gambler plays exclusively on the machine which the gambler supposes to be the best one ("exploitation"), then the gambler may fail to discover that one of the other arms, in fact, has a higher average return. On the other hand, if the gambler spends too much time trying out all K machines and then makes a decision based on the gathered statistics ("exploration"), then the gambler may fail to play the best arm for long enough a period of time to get a high total return.…”
Section: Multi-armed Bandit Problemmentioning
confidence: 99%
“…10), and then computes a best response using value iteration (discount factor γ = 0.95). It uses ε-greedy exploration, ε = An expert algorithm (Auer et al, 1995) with well-defined regret bounds for the multi-armed bandit problem. It was shown to be effective in some repeated games (Bouzy & Metivier, 2010;Crandall & Goodrich, 2011;Chang & Kaelbling, 2005).…”
Section: Appendix B Performance Of Individual Expertsmentioning
confidence: 99%
“…Many algorithms for repeated games have been developed over the last several decades, including reinforcement learning algorithms (e.g., Watkins & Dayan, 1992;Littman, 1994Littman, , 2001Bowling & Veloso, 2002;Greenwald & Hall, 2003;Crandall & Goodrich, 2011), opponent modeling algorithms (e.g., Fudenberg & Levine, 1998;Ganzfried & Sandholm, 2011), algorithms for computing desirable equilibria (e.g., Littman & Stone, 2005;Cote & Littman, 2008;Johanson, Bard, Lanctot, Gibson, & Bowling, 2012), and expert algorithms (e.g., Auer, Cesa-Bianchi, & Fischer, 2002;de Farias & Megiddo, 2004;Auer, Cesa-Bianchi, Freund, & Schapire, 1995). While sometimes successful, these algorithms typically have one or more of the following shortcomings which preclude their use.…”
Section: Introductionmentioning
confidence: 99%
“…We use the EXP3 algorithm for GSA nodes (variant of the Grigoriadis-Khachiyan algorithm [5,2,1,3]), leading to a probability of choosing an action of the form η + exp(ǫs)/C where η and ǫ are Algorithm 2 Adapting the UCT algorithm for GSA cases.…”
Section: Adapting Uct To the Gsa Acyclic Casementioning
confidence: 99%
“…We did not consider random nodes here, but they could easily be included as well. We do not write explicitly a proof of the consistency of these algorithms, but we guess that the proof is a consequence of properties in [5,8,2,1]. We'll see the choice of constants below.…”
Section: End While If the Root Is In 1p Or 2p Thenmentioning
confidence: 99%