2020
DOI: 10.1017/9781108571401
|View full text |Cite
|
Sign up to set email alerts
|

Bandit Algorithms

Abstract: Bandit problems were introduced by William R. Thompson in an article published in 1933 in Biometrika. Thompson was interested in medical trials and the cruelty of running a trial blindly, without adapting the treatment allocations on the fly as the drug appears more or Figure 1.1 Mouse learning a T-maze.less effective. The name comes from the 1950s, when Frederick Mosteller and Robert Bush decided to study animal learning and ran trials on mice and then on humans. The mice faced the dilemma of choosing to go l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

17
1,370
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 1,184 publications
(1,387 citation statements)
references
References 167 publications
17
1,370
0
Order By: Relevance
“…where the division of δ by n is in accordance with a union bound over the n arms. 1 A zero-mean random variable Z is sub-Gaussian with parameter σ if E[e λZ ] ≤ exp λ 2 σ 2…”
Section: A Auxiliary Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…where the division of δ by n is in accordance with a union bound over the n arms. 1 A zero-mean random variable Z is sub-Gaussian with parameter σ if E[e λZ ] ≤ exp λ 2 σ 2…”
Section: A Auxiliary Resultsmentioning
confidence: 99%
“…The literature on theory and algorithms for MAB problems is extensive; see [1], [7] for recent overviews. One of the main defining features of such problems is the distinction between stochastic vs. adversarial rewards; this paper focuses exclusively on the former.…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…Let ρ(t) be the maximum instantaneous estimation error of the multi-objective reward in trial t, where ρ(t) = max i∈ [I],a∈A |r i a (t)−r i a (t)|. Based on existing studies [Lattimore and Szepesvári, 2018] we can show that ρ(t) is bounded with high probability. When the instantaneous error in estimating multi-objective rewards is bounded, we can upper bound the regret of a multi-armed bandit instance running in a specific rank as O I…”
Section: Discussionmentioning
confidence: 99%
“…An excellent book on multi-armed bandits, Lattimore and Szepesvári (2019), will appear later this year. This book is much larger than ours; it provides a deeper treatment for a number of topics, and omits a few others.…”
Section: Prefacementioning
confidence: 99%