2021
DOI: 10.48550/arxiv.2106.02126
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

Abstract: One of the key drivers of complexity in the classical (stochastic) multi-armed bandit (MAB) problem is the difference between mean rewards in the top two arms, also known as the instance gap. The celebrated Upper Confidence Bound (UCB) policy is among the simplest optimism-based MAB algorithms that naturally adapts to this gap: for a horizon of play n, it achieves optimal O (log n) regret in instances with "large" gaps, and a near-optimal O √ n log n minimax regret when the gap can be arbitrarily "small." This… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 18 publications
0
2
0
Order By: Relevance
“…In contrast, the recently introduced countable-armed bandit (CAB) problem (Kalvit and Zeevi, 2020;de Heide et al, 2021) that our work is most closely related to, is fundamentally simpler owing to a finite set of arm-types; this is central to the achievability of logarithmic regret in the CAB problem. (Kalvit and Zeevi, 2020) study the CAB problem with K = 2 types, and propose an online adaptive algorithm that achieves an expected cumulative regret of O (log n) after any number of plays n. Notably, their algorithm does not require ex ante knowledge of (a lower bound on) the fraction α of "optimal" arms in the arm-reservoir, and its regret analysis relies purely on certain novel properties of the UCB algorithm (Auer et al, 2002a); these properties are elucidated in (Kalvit and Zeevi, 2021). The general CAB problem with multiple arm-types was subsequently studied in (de Heide et al, 2021), where it was shown that achieving logarithmic regret absent ex ante knowledge of α is impossible if K > 2.…”
Section: Related Literaturementioning
confidence: 99%
“…In contrast, the recently introduced countable-armed bandit (CAB) problem (Kalvit and Zeevi, 2020;de Heide et al, 2021) that our work is most closely related to, is fundamentally simpler owing to a finite set of arm-types; this is central to the achievability of logarithmic regret in the CAB problem. (Kalvit and Zeevi, 2020) study the CAB problem with K = 2 types, and propose an online adaptive algorithm that achieves an expected cumulative regret of O (log n) after any number of plays n. Notably, their algorithm does not require ex ante knowledge of (a lower bound on) the fraction α of "optimal" arms in the arm-reservoir, and its regret analysis relies purely on certain novel properties of the UCB algorithm (Auer et al, 2002a); these properties are elucidated in (Kalvit and Zeevi, 2021). The general CAB problem with multiple arm-types was subsequently studied in (de Heide et al, 2021), where it was shown that achieving logarithmic regret absent ex ante knowledge of α is impossible if K > 2.…”
Section: Related Literaturementioning
confidence: 99%
“…In [27,10,14], diffusion approximations are given for TS and UCB algorithms under scaling regimes where the gaps between arm means shrink with the time horizon in a worst-case way. Although distributional characterizations of regret are obtained in these works (in terms of solutions to stochastic differential equations and random ordinary differential equations), the worst-case scaling regimes considered makes the results incomparable to those in this paper.…”
mentioning
confidence: 99%