Our system is currently under heavy load due to increased usage. We're actively working on upgrades to improve performance. Thank you for your patience.
2018
DOI: 10.48550/arxiv.1807.07623
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits

Abstract: We derive an algorithm that achieves the optimal (up to constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent with Tsallis entropy regularizer. We provide a complete characterization of such algorithms and show that Tsallis entropy with power α = 1/2 achieves the goal. In addition, the proposed algorithm enjoys improved regret guarantees in two intermediate regimes: the moderately c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(11 citation statements)
references
References 4 publications
0
11
0
Order By: Relevance
“…We are now ready to show the main result of this section. Our proof follows the techniques of (Shamir and Zhang, 2013) combined with the analysis of FTRL in (Abernethy et al, 2015;Zimmert and Seldin, 2018). Let u ∈ W be a fixed vector in the convex set to be chosen later.…”
Section: A Last Iterate Convergence Of Ftrlmentioning
confidence: 98%
See 1 more Smart Citation
“…We are now ready to show the main result of this section. Our proof follows the techniques of (Shamir and Zhang, 2013) combined with the analysis of FTRL in (Abernethy et al, 2015;Zimmert and Seldin, 2018). Let u ∈ W be a fixed vector in the convex set to be chosen later.…”
Section: A Last Iterate Convergence Of Ftrlmentioning
confidence: 98%
“…The proof of Theorem 5.1 follows the ideas for the analysis of FTRL in Abernethy et al ( 2015) and Zimmert and Seldin (2018), and combines them with the techniques used to obtain last iterate guarantees for stochastic gradient descent in Shamir and Zhang (2013). Since the analysis is somewhat standard we are going to devote the rest of the section to the privacy analysis.…”
Section: Proof Techniquesmentioning
confidence: 99%
“…Lykouris et al [LMPL18] introduced a variant of the standard stochastic multi-armed bandit problem, where an adversary can corrupt a number of samples, and provided algorithms with learning rates that degrade according to the number of corruptions. The guarantees for stochastic multi-armed bandits were subsequently strengthened by Gupta et al [GKT19] and Zimmert and Seldin [ZS19], and the concept of adversarial corruptions has also been extended to several other settings including dynamic assortment optimization [CKW19], linear bandits [LLS19] and reinforcement learning [LSSS19]. Our work differs from these in that we use adversarial corruptions as a modeling tool to capture arbitrarily irrational agent behavior in game-theoretic settings.…”
Section: Related Workmentioning
confidence: 99%
“…Bridging between the stochastic and adversarial settings in multi-armed bandits, and in online learning more generally, has been a topic of significant interest in recent years. Most research in this direction has focused on obtaining "the best of both worlds" guarantees (Bubeck and Slivkins, 2012;Seldin and Slivkins, 2014;Auer and Chiang, 2016;Seldin and Lugosi, 2017;Zimmert and Seldin, 2018;Zimmert et al, 2019). The goal there is to achieve the better of the bounds at the two extremes: the worst case O( √ T )-type bound on any problem instance, and the better O(log T )type bound whenever the instance is actually stochastic.…”
Section: Related Workmentioning
confidence: 99%
“…Adversarial contaminations similar to those considered here have been studied before in the context of bandit problems. Seldin and Slivkins (2014) and Zimmert and Seldin (2018) consider a "moderately contaminated" regime in which the adversarial corruptions do not reduce the gap min i =i ⋆ ∆ i by more than a constant factor at any point in time. This regime of contamination is very restrictive and, for example, precludes virtually any form of corruption in the early stages on learning.…”
Section: Related Workmentioning
confidence: 99%