2012
DOI: 10.48550/arxiv.1204.5721
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
144
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 98 publications
(144 citation statements)
references
References 0 publications
0
144
0
Order By: Relevance
“…Brute force Dynamic programming (Bf ) Here, we do not describe a policy U = {U a t } a∈A,t∈ 0,T −1 , but an algorithm Bf to compute V 0 (π 0 ) in (3). Solving the maximization problem (3), that is, computing V 0 (π 0 ) for a given prior (like, for instance, the uniform law given by the beta distribution β(1, 1) for all arms) can be done using Dynamic programming on the equivalent formulation (8). This is however only possible for relatively small instances of problem (3), that is, for a limited number |A| of arms and a limited time horizon T .…”
Section: Algorithms Testedmentioning
confidence: 99%
See 2 more Smart Citations
“…Brute force Dynamic programming (Bf ) Here, we do not describe a policy U = {U a t } a∈A,t∈ 0,T −1 , but an algorithm Bf to compute V 0 (π 0 ) in (3). Solving the maximization problem (3), that is, computing V 0 (π 0 ) for a given prior (like, for instance, the uniform law given by the beta distribution β(1, 1) for all arms) can be done using Dynamic programming on the equivalent formulation (8). This is however only possible for relatively small instances of problem (3), that is, for a limited number |A| of arms and a limited time horizon T .…”
Section: Algorithms Testedmentioning
confidence: 99%
“…Solving the maximization problem (3), that is, computing V 0 (π 0 ) for a given prior (like, for instance, the uniform law given by the beta distribution β(1, 1) for all arms) can be done using Dynamic programming on the equivalent formulation (8). This is however only possible for relatively small instances of problem (3), that is, for a limited number |A| of arms and a limited time horizon T . We recall here that solving the problem for |A| arms requires solving a Bellman equation with a state of dimension 2|A| (a state described by two integers per arm), which implies an exponential increase in computational cost with respect to |A|.…”
Section: Algorithms Testedmentioning
confidence: 99%
See 1 more Smart Citation
“…This has origns in clinical trial studies dating back to 1933 (Thompson 1933) which gave rise to the earliest known MAB heuristic, Thompson Sampling (see Agrawal & Goyal (2012)). Today, the MAB problem manifests itself in various forms with applications ranging from dynamic pricing and online auctions to packet routing, scheduling, e-commerce and matching markets among others (see Bubeck & Cesa-Bianchi (2012) for a comprehensive survey of different formulations). In the canonical stochastic MAB problem, a decision maker (DM) pulls one of K arms sequentially at each time t ∈ {1, 2, ...}, and receives a random payoff drawn according to an arm-dependent distribution.…”
Section: Introductionmentioning
confidence: 99%
“…We believe our results may also present new design considerations, in particular, how to achieve, loosely speaking, the "best of both worlds" for Thompson Sampling, by addressing its "small gap" instability. Lastly, we note that our proof techniques are markedly different from the conventional methodology adopted in MAB literature, e.g., Audibert, Munos & Szepesvári (2009), Bubeck & Cesa-Bianchi (2012), Agrawal & Goyal (2017), and may be of independent interest in the study of related learning algorithms.…”
Section: Introductionmentioning
confidence: 99%