Model Selection in Contextual Stochastic Bandit Problems

Pacchiano, Aldo; Phan, My V. T.; Abbasi-Yadkori, Yasin; Rao, Anup; Zimmert, Julian; Lattimore, Tor; Szepesvári, Csaba

doi:10.48550/arxiv.2003.01704

Cited by 12 publications

(36 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Existing approaches to address either the stronger Objective 1 [9] or the weaker Objective 2 [16] make restrictive assumptions 2 regarding the conditioning (what we will call diversity) of the contexts. Other, more dataagnostic approaches [3,32,31,28] achieve neither of the above objectives. This leads us to ask whether we can design a universal model selection approach that is data-agnostic (other than requiring a probability model on the contexts) and achieves either Objective 1 or 2.…”

Section: Introductionmentioning

confidence: 99%

Universal and data-adaptive algorithms for model selection in linear contextual bandits

Muthukumar¹,

Krishnamurthy²

2021

Preprint

View full text Add to dashboard Cite

Model selection in contextual bandits is an important complementary problem to regret minimization with respect to a fixed model class. We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem. Even in this instance, current state-of-the-art methods explore in a suboptimal manner and require strong "feature-diversity" conditions. In this paper, we introduce new algorithms that a) explore in a dataadaptive manner, and b) provide model selection guarantees of the form O(d α T 1−α ) with no feature diversity conditions whatsoever, where d denotes the dimension of the linear model and T denotes the total number of rounds. The first algorithm enjoys a "best-of-both-worlds" property, recovering two prior results that hold under distinct distributional assumptions, simultaneously. The second removes distributional assumptions altogether, expanding the scope for tractable model selection. Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.

show abstract

Section: Introductionmentioning

confidence: 99%

Universal and data-adaptive algorithms for model selection in linear contextual bandits

Muthukumar¹,

Krishnamurthy²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…[14] introduce a new family of algorithms that require access to an online oracle for square loss regression and address the case of adversarial contexts. Concurrent work of [33] solves the case when contexts / action sets are stochastic. Both works ( [14] and [33]) leverage CORRAL-type aggregation [2] of contextual bandit algorithms and achieve the optimal Õ( √ dT ǫ + d √ T ) regret bound.…”

Section: Introductionmentioning

confidence: 99%

“…Concurrent work of [33] solves the case when contexts / action sets are stochastic. Both works ( [14] and [33]) leverage CORRAL-type aggregation [2] of contextual bandit algorithms and achieve the optimal Õ( √ dT ǫ + d √ T ) regret bound. Finally, in [32], the authors present a practical master algorithm that plays base algorithms that come with a candidate regret bound that may not hold during all rounds.…”

Section: Introductionmentioning

confidence: 99%

“…The master algorithm plays base algorithms in a balanced way and suitably eliminates algorithms whose regret bound is no longer valid. Similarly to the previous works that rely on the CORRAL-type master algorithms, we use the balancing master algorithm of [33] together with our GP-bandit base algorithm to provide contextual misspecified regret bounds.…”

Section: Introductionmentioning

confidence: 99%

“…• For when ǫ is unknown, we propose another algorithm based on uncertainty sampling and phased exploration that achieves (up to polylog factors) the previous regret rates in the misspecified setting, and standard regret guarantees in the realizable case (when ǫ = 0). • Finally, we consider a misspecified contextual kernelized problem, and show that when action sets are stochastic, our EC-GP-UCB algorithm can be effectively combined with the regret bound balancing strategy from [33] to achieve previous regret bounds (up to some additive lower order terms).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Misspecified Gaussian Process Bandit Optimization

Bogunovic¹,

Krause²

2021

Preprint

View full text Add to dashboard Cite

We consider the problem of optimizing a black-box function based on noisy bandit feedback. Kernelized bandit algorithms have shown strong empirical and theoretical performance for this problem. They heavily rely on the assumption that the model is well-specified, however, and can fail without it. Instead, we introduce a misspecified kernelized bandit setting where the unknown function can be ǫ-uniformly approximated by a function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS). We design efficient and practical algorithms whose performance degrades minimally in the presence of model misspecification. Specifically, we present two algorithms based on Gaussian process (GP) methods: an optimistic EC-GP-UCB algorithm that requires knowing the misspecification error, and Phased GP Uncertainty Sampling, an elimination-type algorithm that can adapt to unknown model misspecification. We provide upper bounds on their cumulative regret in terms of ǫ, the time horizon, and the underlying kernel, and we show that our algorithm achieves optimal dependence on ǫ with no prior knowledge of misspecification. In addition, in a stochastic contextual setting, we show that EC-GP-UCB can be effectively combined with the regret bound balancing strategy and attain similar regret bounds despite not knowing ǫ.

show abstract

Multitask Bandit Learning Through Heterogeneous Feedback Aggregation

Wang,

Zhang,

Singh

et al. 2020

Preprint

View full text Add to dashboard Cite

In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the -multiplayer multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg( ), that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise similarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise similarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.

show abstract

Model Selection in Contextual Stochastic Bandit Problems

Cited by 12 publications

References 2 publications

Universal and data-adaptive algorithms for model selection in linear contextual bandits

Universal and data-adaptive algorithms for model selection in linear contextual bandits

Misspecified Gaussian Process Bandit Optimization

Multitask Bandit Learning Through Heterogeneous Feedback Aggregation

Contact Info

Product

Resources

About