Robust Risk-Averse Stochastic Multi-armed Bandits

Maillard, Odalric-Ambrym

doi:10.1007/978-3-642-40935-6_16

Cited by 34 publications

(25 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…PROOF OF PROPOSITION 3.3. From the lower bound in (20) in the proof of Theorem 3.2, there exists c > 0 such that for all x in the range T ≤ x ≤ (1 − )T and T sufficiently large,…”

Section: 1mentioning

confidence: 99%

See 1 more Smart Citation

The Fragility of Optimized Bandit Algorithms

Lin¹

2021

Preprint

View full text Add to dashboard Cite

Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the associated algorithms necessarily have the undesirable feature that the tail of the regret distribution behaves like that of a truncated Cauchy distribution. Furthermore, for p > 1, the p'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the number of sub-optimal arm plays. We show that optimized Thompson sampling and UCB bandit designs are also fragile, in the sense that when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays that make that arm appear to be sub-optimal, thereby causing the algorithm to sample a truly sub-optimal arm much more than would be optimal.

show abstract

“…PROOF OF PROPOSITION 3.3. From the lower bound in (20) in the proof of Theorem 3.2, there exists c > 0 such that for all x in the range T ≤ x ≤ (1 − )T and T sufficiently large,…”

Section: 1mentioning

confidence: 99%

“…There is also a growing literature on risk-averse formulations of the MAB problem, with a non-comprehensive list being: [23,20,29,24,26,11,7,25,28,21,4,15]. As noted earlier, risk-averse formulations involve defining arm optimality using criteria other than the expected value.…”

mentioning

confidence: 99%

The Fragility of Optimized Bandit Algorithms

Lin¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, the performance guarantees were still within the risk-neutral framework (in terms of the loss in the expected total reward) under the assumption that the best action in terms of the mean value is also the best action in terms of the conditional value at risk. Logarithm of moment generating function was considered as a risk measure for bandit problems in [20] and high probability bounds on regret were obtained. We point out that the logarithm of the moment generating function reduces to mean-variance for a random variable with Gaussian distribution.…”

Section: Related Workmentioning

confidence: 99%

Decision Variance in Risk-Averse Online Learning

Vakili¹,

Boukouvalas²,

Zhao

2019

2019 IEEE 58th Conference on Decision and Control (CDC)

View full text Add to dashboard Cite

Online learning has traditionally focused on the expected rewards. In this paper, a risk-averse online learning problem under the performance measure of the mean-variance of the rewards is studied. Both the bandit and full information settings are considered. The performance of several existing policies is analyzed, and new fundamental limitations on risk-averse learning is established. In particular, it is shown that although a logarithmic distribution-dependent regret in time T is achievable (similar to the risk-neutral problem), the worst-case (i.e. minimax) regret is lower bounded by Ω(T ) (in contrast to the Ω( √ T ) lower bound in the risk-neutral problem). This sharp difference from the risk-neutral counterpart is caused by the the variance in the player's decisions, which, while absent in the regret under the expected reward criterion, contributes to excess mean-variance due to the non-linearity of this risk measure. The role of the decision variance in regret performance reflects a risk-averse player's desire for robust decisions and outcomes.

show abstract

“…Other risk-averse MAB papers also considered the CVaR. Upper confidence bound algorithms in this context are studied by Maillard (2013), Cassel et al (2018), Khajonchotpanya et al (2021). Alternative arm selection approaches in the context of risk-averse bandits include the max-min approach discussed in Galichet et al (2013), the successive rejects relying on concentration bound guarantees of Kolla et al (2019a), robust estimation-based algorithms in Kagrecha et al (2020), or Thompson Sampling approaches in Chang et al (2020) and Baudry et al (2021).…”

Section: Introductionmentioning

confidence: 99%

Risk averse non-stationary multi-armed bandits

Benac¹,

Godin²

2021

Preprint

View full text Add to dashboard Cite

This paper tackles the risk averse multi-armed bandits problem when incurred losses are nonstationary. The conditional value-at-risk (CVaR) is used as the objective function. Two estimation methods are proposed for this objective function in the presence of non-stationary losses, one relying on a weighted empirical distribution of losses and another on the dual representation of the CVaR. Such estimates can then be embedded into classic arm selection methods such as -greedy policies. Simulation experiments assess the performance of the arm selection algorithms based on the two novel estimation approaches, and such policies are shown to outperform naive benchmarks not taking non-stationarity into account.

show abstract

Robust Risk-Averse Stochastic Multi-armed Bandits

Cited by 34 publications

References 23 publications

The Fragility of Optimized Bandit Algorithms

The Fragility of Optimized Bandit Algorithms

Decision Variance in Risk-Averse Online Learning

Risk averse non-stationary multi-armed bandits

Contact Info

Product

Resources

About