Gambling in a rigged casino: The adversarial multi-armed bandit problem

Auer, Péter; Cesa-Bianchi, Nicolò; Freund, Yoav; Schapire, Robert E.

doi:10.1109/sfcs.1995.492488

Cited by 484 publications

(594 citation statements)

References 13 publications

Supporting

Mentioning

586

Contrasting

Unclassified

Order By: Relevance

“…Assuming each arm in a slot machine has a different distribution of rewards, the goal is to find out the arm with the best expected return as early as possible and then to keep using that specific arm. The problem is a classical example of the trade-off between exploration and exploitation [13]: On the one hand, if the gambler plays exclusively on the machine which the gambler supposes to be the best one ("exploitation"), then the gambler may fail to discover that one of the other arms, in fact, has a higher average return. On the other hand, if the gambler spends too much time trying out all K machines and then makes a decision based on the gathered statistics ("exploration"), then the gambler may fail to play the best arm for long enough a period of time to get a high total return.…”

Section: Multi-armed Bandit Problemmentioning

confidence: 99%

Bandit Learning with Concurrent Transmissions for Energy-Efficient Flooding in Sensor Networks

Zhang¹,

Gao²,

Theel³

2018

EAI Endorsed Transactions on Industrial Networks and Intelligen

View full text Add to dashboard Cite

Concurrent transmissions, a novel communication paradigm, has been shown to effectively accomplish a reliable and energy-efficient flooding in low-power wireless networks. With multiple nodes exploiting a receive-and-forward scheme in the network, this technique inevitably introduces communication redundancy and consequently raises the energy consumption of the nodes. In this article, we propose Less is More (LiM), an energy-efficient flooding protocol for wireless sensor networks. LiM builds on concurrent transmissions, exploiting constructive interference and the capture effect to achieve high reliability and low latency. Moreover, LiM is equipped with a machine learning capability to progressively reduce redundancy while maintaining high reliability. As a result, LiM is able to significantly reduce the radio-on time and therefore the energy consumption. We compare LiM with our baseline protocol Glossy by extensive experiments in the 30-node testbed FlockLab. Experimental results show that LiM highly reduces the broadcast redundancy in flooding. It outperforms the baseline protocol in terms of radio-on time, while attaining a high reliability of over 99.50% and an average end-to-end latency around 2 milliseconds in all experimental scenarios.

show abstract

Section: Multi-armed Bandit Problemmentioning

confidence: 99%

Bandit Learning with Concurrent Transmissions for Energy-Efficient Flooding in Sensor Networks

Zhang¹,

Gao²,

Theel³

2018

EAI Endorsed Transactions on Industrial Networks and Intelligen

View full text Add to dashboard Cite

show abstract

“…10), and then computes a best response using value iteration (discount factor γ = 0.95). It uses ε-greedy exploration, ε = An expert algorithm (Auer et al, 1995) with well-defined regret bounds for the multi-armed bandit problem. It was shown to be effective in some repeated games (Bouzy & Metivier, 2010;Crandall & Goodrich, 2011;Chang & Kaelbling, 2005).…”

Section: Appendix B Performance Of Individual Expertsmentioning

confidence: 99%

“…Many algorithms for repeated games have been developed over the last several decades, including reinforcement learning algorithms (e.g., Watkins & Dayan, 1992;Littman, 1994Littman, , 2001Bowling & Veloso, 2002;Greenwald & Hall, 2003;Crandall & Goodrich, 2011), opponent modeling algorithms (e.g., Fudenberg & Levine, 1998;Ganzfried & Sandholm, 2011), algorithms for computing desirable equilibria (e.g., Littman & Stone, 2005;Cote & Littman, 2008;Johanson, Bard, Lanctot, Gibson, & Bowling, 2012), and expert algorithms (e.g., Auer, Cesa-Bianchi, & Fischer, 2002;de Farias & Megiddo, 2004;Auer, Cesa-Bianchi, Freund, & Schapire, 1995). While sometimes successful, these algorithms typically have one or more of the following shortcomings which preclude their use.…”

Section: Introductionmentioning

confidence: 99%

Towards Minimizing Disappointment in Repeated Games

Crandall¹

2014

jair

View full text Add to dashboard Cite

We consider the problem of learning in repeated games against arbitrary associates. Specifically, we study the ability of expert algorithms to quickly learn effective strategies in repeated games, towards the ultimate goal of learning near-optimal behavior against any arbitrary associate within only a handful of interactions. Our contribution is three-fold. First, we advocate a new metric, called disappointment, for evaluating expert algorithms in repeated games. Unlike minimizing traditional notions of regret, minimizing disappointment in repeated games is equivalent to maximizing payoffs. Unfortunately, eliminating disappointment is impossible to guarantee in general. However, it is possible for an expert algorithm to quickly achieve low disappointment against many known classes of algorithms in many games. Second, we show that popular existing expert algorithms often fail to achieve low disappointment against a variety of associates, particularly in early rounds of the game. Finally, we describe a new meta-algorithm that can be applied to existing expert algorithms to substantially reduce disappointment in many two-player repeated games when associates follow various static, reinforcement learning, and expert algorithms.

show abstract

“…We use the EXP3 algorithm for GSA nodes (variant of the Grigoriadis-Khachiyan algorithm [5,2,1,3]), leading to a probability of choosing an action of the form η + exp(ǫs)/C where η and ǫ are Algorithm 2 Adapting the UCT algorithm for GSA cases.…”

Section: Adapting Uct To the Gsa Acyclic Casementioning

confidence: 99%

“…We did not consider random nodes here, but they could easily be included as well. We do not write explicitly a proof of the consistency of these algorithms, but we guess that the proof is a consequence of properties in [5,8,2,1]. We'll see the choice of constants below.…”

Section: End While If the Root Is In 1p Or 2p Thenmentioning

confidence: 99%