Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Cowan, Wesley; Honda, Junichi; Katehakis, Michael N.

doi:10.48550/arxiv.1504.05823

Cited by 7 publications

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As such, this section essentially reproduces the result of Cowan et al [14] (presented therein in terms of classical regret) in the framework established herein. In this case the controller is interested in activating the bandit with maximum expected value as often as possible.…”

Section: Unknown Means and Unknown Variances: Maximizing Expected Valuesupporting

confidence: 74%

Asymptotically Optimal Sequential Experimentation Under Generalized Ranking

Cowan¹,

Katehakis²

2015

Preprint

Self Cite

View full text Add to dashboard Cite

We consider the classical problem of a controller activating (or sampling) sequentially from a finite number of N 2 populations, specified by unknown distributions. Over some time horizon, at each time n = 1, 2, . . ., the controller wishes to select a population to sample, with the goal of sampling from a population that optimizes some "score" function of its distribution, e.g., maximizing the expected sum of outcomes or minimizing variability. We define a class of Uniformly Fast (UF) sampling policies and show, under mild regularity conditions, that there is an asymptotic lower bound for the expected total number of sub-optimal population activations. Then, we provide sufficient conditions under which a UCB policy is UF and asymptotically optimal, since it attains this lower bound. Explicit solutions are provided for a number of examples of interest, including general score functionals on unconstrained Pareto distributions (of potentially infinite mean), and uniform distributions of unknown support. Additional results on bandits of Normal distributions are also provided.

show abstract

Section: Unknown Means and Unknown Variances: Maximizing Expected Valuesupporting

confidence: 74%

Asymptotically Optimal Sequential Experimentation Under Generalized Ranking

Cowan¹,

Katehakis²

2015

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…These policies form the basis for deriving logarithmic regret polices for more general models, cf. Auer et al (2002), Auer and Ortner (2010), Cowan et al (2015), Cowan and Katehakis (2015a).…”

Section: Introductionmentioning

confidence: 99%

Asymptotically Optimal Multi-Armed Bandit Policies Under a Cost Constraint

Burnetas¹,

Kanavetas

Katehakis

2016

Prob. Eng. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

We develop asymptotically optimal policies for the multi armed bandit (MAB), problem, under a cost constraint. This model is applicable in situations where each sample (or activation) from a population (bandit) incurs a known bandit dependent cost. Successive samples from each population are iid random variables with unknown distribution. The objective is to design a feasible policy for deciding from which population to sample from, so as to maximize the expected sum of outcomes of n total samples or equivalently to minimize the regret due to lack on information on sample distributions, For this problem we consider the class of feasible uniformly fast (f-UF) convergent policies, that satisfy the cost constraint sample-path wise. We first establish a necessary asymptotic lower bound for the rate of increase of the regret function of f-UF policies. Then we construct a class of f-UF policies and provide conditions under which they are asymptotically optimal within the class of f-UF policies, achieving this asymptotic lower bound. At the end we provide the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances. . Asymptotic optimality, finite horizon regret bounds, and a solution to an open problem. optimal Bayesian sequential change detection and identification rules. -armed bandit with budget constraint and variable costs. In AAAI-13, pages 232-238, 2013.Eugene A Feinberg, Pavlo O Kasyanov, and Michael Z Zgurovsky. Convergence of value iterations for total-cost mdps and pomdps with general state and action sets.

show abstract

“…Policies that achieve this minimal asymptotic growth rate have been derived for specific parametric models in Lai and Robbins [9], Burnetas and Katehakis [4], Honda and Takemura [7], Honda and Takemura [6], Honda and Takemura [8], Cowan et al [5] and references therein. In general it is not always easy to obtain such optimal polices, thus, policies that satisfy the less strict requirement of Eq.…”

Section: Related Literaturementioning

confidence: 99%

“…In such instances, we may in fact conclude from the results presented herein, and standard results relating modes of convergence, that for the policies constructed here, for g(n) = O(ln n), the sequences of random variables Rπ F g (n)/g(n), Rπ O g (n)/g(n) are not uniformly integrable. An example as to how this can occur is given via the proof of Theorem 2 of Cowan et al [5], where with a non-trivial probability, non-representative initial sampling of each bandit biases expected future activations of sub-optimal bandits super-logarithmically. This effect does not influence the long term almost sure behavior of these policies.…”

Section: Related Literaturementioning

confidence: 99%

“…This is more in line with the common UCB policies, which frequently have inflation terms of the form O ln n/T i π (n) (though this is hardly necessary, c.f. Cowan et al [5]) with ln n serving the "exploration-driving" role of g. However, introducing this extra H i function does not influence the order of the growth of pseudo-regret, it simply changes the relevant order constants, at the cost of complicating the analysis.…”

Section: A Class Of G-index Policiesmentioning

confidence: 99%

See 1 more Smart Citation

Exploration–exploitation Policies With Almost Sure, Arbitrarily Slow Growing Asymptotic Regret

Cowan

Katehakis

2019

Prob. Eng. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

The purpose of this paper is to provide further understanding into the structure of the sequential allocation ("stochastic multi-armed bandit", or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies π.For any slowly increasing function g, subject to mild regularity constraints, we construct two policies (the g-Forcing, and the g-Inflated Sample Mean) that achieve a measure of regret of order O(g(n)) almost surely as n → ∞, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function g effectively controls the "exploration" of the classical "exploration/exploitation" tradeoff.

show abstract

Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Cited by 7 publications

References 0 publications

Asymptotically Optimal Sequential Experimentation Under Generalized Ranking

Asymptotically Optimal Sequential Experimentation Under Generalized Ranking

Asymptotically Optimal Multi-Armed Bandit Policies Under a Cost Constraint

Exploration–exploitation Policies With Almost Sure, Arbitrarily Slow Growing Asymptotic Regret

Contact Info

Product

Resources

About