Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Bubeck, Sébastien; Cesa-Bianchi, Nicolò

doi:10.48550/arxiv.1204.5721

Cited by 98 publications

(144 citation statements)

References 0 publications

Supporting

Mentioning

144

Contrasting

Order By: Relevance

“…Brute force Dynamic programming (Bf ) Here, we do not describe a policy U = {U a t } a∈A,t∈ 0,T −1 , but an algorithm Bf to compute V 0 (π 0 ) in (3). Solving the maximization problem (3), that is, computing V 0 (π 0 ) for a given prior (like, for instance, the uniform law given by the beta distribution β(1, 1) for all arms) can be done using Dynamic programming on the equivalent formulation (8). This is however only possible for relatively small instances of problem (3), that is, for a limited number |A| of arms and a limited time horizon T .…”

Section: Algorithms Testedmentioning

confidence: 99%

“…Solving the maximization problem (3), that is, computing V 0 (π 0 ) for a given prior (like, for instance, the uniform law given by the beta distribution β(1, 1) for all arms) can be done using Dynamic programming on the equivalent formulation (8). This is however only possible for relatively small instances of problem (3), that is, for a limited number |A| of arms and a limited time horizon T . We recall here that solving the problem for |A| arms requires solving a Bellman equation with a state of dimension 2|A| (a state described by two integers per arm), which implies an exponential increase in computational cost with respect to |A|.…”

Section: Algorithms Testedmentioning

confidence: 99%

“…The seminal work [15] identifies an asymptotically efficient policy. Other problem formulations and approaches were proposed following this milestone [16,3].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Decomposition-Coordination Method for Finite Horizon Bandit Problems

de Lara,

Heymann,

Chancelier

2021

Preprint

View full text Add to dashboard Cite

Optimally solving a multi-armed bandit problem suffers the curse of dimensionality. Indeed, resorting to dynamic programming leads to an exponential growth of computing time, as the number of arms and the horizon increase. We introduce a decompositioncoordination heuristic, DeCo, that turns the initial problem into parallelly coordinated one-armed bandit problems. As a consequence, we obtain a computing time which is essentially linear in the number of arms. In addition, the decomposition provides a theoretical lower bound on the regret. For the two-armed bandit case, dynamic programming provides the exact solution, which is almost matched by the DeCo heuristic. Moreover, in numerical simulations with up to 100 rounds and 20 arms, DeCo outperforms classic algorithms (Thompson sampling and Kullback-Leibler upper-confidence bound) and almost matches the theoretical lower bound on the regret for 20 arms.

show abstract

Section: Algorithms Testedmentioning

confidence: 99%

Section: Algorithms Testedmentioning

confidence: 99%

See 1 more Smart Citation

Decomposition-Coordination Method for Finite Horizon Bandit Problems

de Lara,

Heymann,

Chancelier

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This has origns in clinical trial studies dating back to 1933 (Thompson 1933) which gave rise to the earliest known MAB heuristic, Thompson Sampling (see Agrawal & Goyal (2012)). Today, the MAB problem manifests itself in various forms with applications ranging from dynamic pricing and online auctions to packet routing, scheduling, e-commerce and matching markets among others (see Bubeck & Cesa-Bianchi (2012) for a comprehensive survey of different formulations). In the canonical stochastic MAB problem, a decision maker (DM) pulls one of K arms sequentially at each time t ∈ {1, 2, ...}, and receives a random payoff drawn according to an arm-dependent distribution.…”

Section: Introductionmentioning

confidence: 99%

“…We believe our results may also present new design considerations, in particular, how to achieve, loosely speaking, the "best of both worlds" for Thompson Sampling, by addressing its "small gap" instability. Lastly, we note that our proof techniques are markedly different from the conventional methodology adopted in MAB literature, e.g., Audibert, Munos & Szepesvári (2009), Bubeck & Cesa-Bianchi (2012), Agrawal & Goyal (2017), and may be of independent interest in the study of related learning algorithms.…”

Section: Introductionmentioning

confidence: 99%

A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

Kalvit¹,

Zeevi²

2021

Preprint

View full text Add to dashboard Cite

One of the key drivers of complexity in the classical (stochastic) multi-armed bandit (MAB) problem is the difference between mean rewards in the top two arms, also known as the instance gap. The celebrated Upper Confidence Bound (UCB) policy is among the simplest optimism-based MAB algorithms that naturally adapts to this gap: for a horizon of play n, it achieves optimal O (log n) regret in instances with "large" gaps, and a near-optimal O √ n log n minimax regret when the gap can be arbitrarily "small." This paper provides new results on the arm-sampling behavior of UCB, leading to several important insights. Among these, it is shown that armsampling rates under UCB are asymptotically deterministic, regardless of the problem complexity. This discovery facilitates new sharp asymptotics and a novel alternative proof for the O √ n log n minimax regret of UCB. Furthermore, the paper also provides the first complete process-level characterization of the MAB problem under UCB in the conventional diffusion scaling. Among other things, the "small" gap worst-case lens adopted in this paper also reveals profound distinctions between the behavior of UCB and Thompson Sampling, such as an incomplete learning phenomenon characteristic of the latter.

show abstract

Multi-armed bandits with dependent arms

Singh,

Liu,

Sun

et al. 2023

Mach Learn

View full text Add to dashboard Cite

We study a variant of the multi-armed bandit problem (MABP) which we call as MABs with dependent arms. Multiple arms are grouped together to form a cluster, and the reward distributions of arms in the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm i not only reveals information about its own reward distribution, but also about all arms belonging to the same cluster. This "correlation" among the arms complicates the exploration-exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the principle of optimism in the face of uncertainty (Lattimore and Szepesvári in Bandit algorithms, Cambridge University Press, 2020), which know the clusters, and hence utilize these additional side observations appropriately while performing exploration-exploitation trade-off. We show that the regret of our algorithms grows as O(K log T) , where K is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that does not utilize these dependencies, the regret scales as O(M log T) , where M is the number of arms. When K ≪ M , i.e. there is a lot of dependencies among arms, our proposed algorithm drastically reduces the dependence of regret on the number of arms.

show abstract

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Cited by 98 publications

References 0 publications

Decomposition-Coordination Method for Finite Horizon Bandit Problems

Decomposition-Coordination Method for Finite Horizon Bandit Problems

A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

Multi-armed bandits with dependent arms

Contact Info

Product

Resources

About