A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

Kalvit, Anand; Zeevi, Assaf

doi:10.48550/arxiv.2106.02126

Cited by 2 publications

(2 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, the recently introduced countable-armed bandit (CAB) problem (Kalvit and Zeevi, 2020;de Heide et al, 2021) that our work is most closely related to, is fundamentally simpler owing to a finite set of arm-types; this is central to the achievability of logarithmic regret in the CAB problem. (Kalvit and Zeevi, 2020) study the CAB problem with K = 2 types, and propose an online adaptive algorithm that achieves an expected cumulative regret of O (log n) after any number of plays n. Notably, their algorithm does not require ex ante knowledge of (a lower bound on) the fraction α of "optimal" arms in the arm-reservoir, and its regret analysis relies purely on certain novel properties of the UCB algorithm (Auer et al, 2002a); these properties are elucidated in (Kalvit and Zeevi, 2021). The general CAB problem with multiple arm-types was subsequently studied in (de Heide et al, 2021), where it was shown that achieving logarithmic regret absent ex ante knowledge of α is impossible if K > 2.…”

Section: Related Literaturementioning

confidence: 99%

The Countable-armed Bandit with Vanishing Arms

Kalvit¹,

Zeevi²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

We consider a bandit problem with countably many arms, partitioned into finitely many types, each characterized by a unique mean reward. A non-stationary distribution governs the relative abundance of each arm-type in the population of arms, aka the arm-reservoir. This non-stationarity is attributable to a probabilistic leakage of "optimal" arms from the reservoir over time, which we refer to as the vanishing arms phenomenon; this induces a time-varying (potentially "endogenous," policy-dependent) distribution over the reservoir. The objective is minimization of the expected cumulative regret. We characterize necessary and sufficient conditions for achievability of sub-linear regret in terms of a critical vanishing rate of optimal arms. We also discuss two reservoir distribution-oblivious algorithms that are long-run-average optimal whenever sub-linear regret is statistically achievable. Numerical experiments highlight a distinctive characteristic of this problem related to ex ante knowledge of the "gap" parameter (the difference between the top two mean rewards): in contrast to the stationary bandit formulation, regret in our setting may suffer substantial inflation under adaptive exploration-based (gap-oblivious) algorithms such as UCB vis-à-vis their non-adaptive forced exploration-based (gap-aware) counterparts like ETC.

show abstract

Section: Related Literaturementioning

confidence: 99%

The Countable-armed Bandit with Vanishing Arms

Kalvit¹,

Zeevi²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In [27,10,14], diffusion approximations are given for TS and UCB algorithms under scaling regimes where the gaps between arm means shrink with the time horizon in a worst-case way. Although distributional characterizations of regret are obtained in these works (in terms of solutions to stochastic differential equations and random ordinary differential equations), the worst-case scaling regimes considered makes the results incomparable to those in this paper.…”

mentioning

confidence: 99%

The Fragility of Optimized Bandit Algorithms

Lin¹

2021

Preprint

View full text Add to dashboard Cite

Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the associated algorithms necessarily have the undesirable feature that the tail of the regret distribution behaves like that of a truncated Cauchy distribution. Furthermore, for p > 1, the p'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the number of sub-optimal arm plays. We show that optimized Thompson sampling and UCB bandit designs are also fragile, in the sense that when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays that make that arm appear to be sub-optimal, thereby causing the algorithm to sample a truly sub-optimal arm much more than would be optimal.

show abstract

A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

Cited by 2 publications

References 18 publications

The Countable-armed Bandit with Vanishing Arms

The Countable-armed Bandit with Vanishing Arms

The Fragility of Optimized Bandit Algorithms

Contact Info

Product

Resources

About