The Nonstochastic Multiarmed Bandit Problem

Auer, Péter; Cesa-Bianchi, Nicolò; Freund, Yoav; Schapire, Robert E.

doi:10.1137/s0097539701398375

Cited by 1,791 publications

(2,312 citation statements)

References 17 publications

Supporting

Mentioning

2,184

Contrasting

Unclassified

Order By: Relevance

“…We also extend our results to the partial information model, also called the adversarial multiarmed bandit (MAB) problem in Auer et al (2002a). In this model, the online algorithm only gets to observe the loss of the action actually selected, and does not see the losses of the actions not chosen.…”

Section: Introductionmentioning

confidence: 64%

From External to Internal Regret

2005

View full text Add to dashboard Cite

External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modified online algorithm, which consistently replaces one action by another.In this paper we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efficient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium.For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modification rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modification rule is the difference between the cost of the online algorithm and the cost of the modified online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting.

show abstract

Section: Introductionmentioning

confidence: 64%

From External to Internal Regret

2005

View full text Add to dashboard Cite

show abstract

“…To solve the multi-armed bandit problem, the exponential-weight algorithm for exploration and exploitation (Exp3) was proposed by Auer et al [14] in 2002. Exp3 is based on a reinforcement learning scheme and it solves the following problem: "If there are many available actions with uncertain outcomes in a system, how should the system act to maximize the quality of the results over many trials?"…”

Section: Multi-armed Bandit Problemmentioning

confidence: 99%

Bandit Learning with Concurrent Transmissions for Energy-Efficient Flooding in Sensor Networks

Zhang¹,

Gao²,

Theel³

2018

EAI Endorsed Transactions on Industrial Networks and Intelligen

View full text Add to dashboard Cite

Concurrent transmissions, a novel communication paradigm, has been shown to effectively accomplish a reliable and energy-efficient flooding in low-power wireless networks. With multiple nodes exploiting a receive-and-forward scheme in the network, this technique inevitably introduces communication redundancy and consequently raises the energy consumption of the nodes. In this article, we propose Less is More (LiM), an energy-efficient flooding protocol for wireless sensor networks. LiM builds on concurrent transmissions, exploiting constructive interference and the capture effect to achieve high reliability and low latency. Moreover, LiM is equipped with a machine learning capability to progressively reduce redundancy while maintaining high reliability. As a result, LiM is able to significantly reduce the radio-on time and therefore the energy consumption. We compare LiM with our baseline protocol Glossy by extensive experiments in the 30-node testbed FlockLab. Experimental results show that LiM highly reduces the broadcast redundancy in flooding. It outperforms the baseline protocol in terms of radio-on time, while attaining a high reliability of over 99.50% and an average end-to-end latency around 2 milliseconds in all experimental scenarios.

show abstract

“…in Cesa-Bianchi et al's CombBand [6], which is itself an adaptation of Auer et al's Exp3 [2] from the finite case to the structured combinatorial case. The distribution from which the actionsπ t are drawn in the algorithm differ from the distribution used in CombBand, and give rise to the technical difficulty of variance estimation, resolved in Lemma 2.…”

Section: Algorithm Banditrank and Its Guaranteementioning

confidence: 99%

Bandit online optimization over the permutahedron

Ailon

Hatano

Takimoto

2016

Theoretical Computer Science

View full text Add to dashboard Cite

The permutahedron is the convex polytope with vertex set consisting of the vectors (π(1), . . . , π(n)) for all permutations (bijections) π over {1, . . . , n}. We study a bandit game in which, at each step t, an adversary chooses a hidden weight weight vector st, a player chooses a vertex πt of the permutahedron and suffers an observed instantaneous loss ofWe study the problem in two regimes. In the first regime, st is a point in the polytope dual to the permutahedron. Algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a regret of O(n √ T log n) after T steps. Unfortunately, CombBand requires at each step an n-by-n matrix permanent computation, a #P -hard problem. Approximating the permanent is possible in the impractical running time of O(n 10 ), with an additional heavy inverse-polynomial dependence on the sought accuracy. We provide an algorithm of slightly worse regret O(n 3/2 √ T ) but with more realistic time complexity O(n 3 ) per step. The technical contribution is a bound on the variance of the Plackett-Luce noisy sorting process's 'pseudo loss', obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices of rational functions in exponents of 3 parameters.In the second regime, st is in the hypercube. For this case we present and analyze an algorithm based on Bubeck et al.'s (2012) OSMD approach with a novel projection and decomposition technique for the permutahedron. The algorithm is efficient and achieves a regret of O(n √ T ), but for a more restricted space of possible loss vectors.

show abstract

The Nonstochastic Multiarmed Bandit Problem

Cited by 1,791 publications

References 17 publications

From External to Internal Regret

From External to Internal Regret

Bandit Learning with Concurrent Transmissions for Energy-Efficient Flooding in Sensor Networks

Bandit online optimization over the permutahedron

Contact Info

Product

Resources

About