Multi-Armed Bandits Under General Depreciation and Commitment

Cowan, Wesley; Katehakis, Michael N.

doi:10.1017/s0269964814000217

Cited by 16 publications

(15 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For other work in this area we refer to Katehakis and Derman [30], Katehakis and Veinott Jr [32], Burnetas and Katehakis [8], Burnetas and Katehakis [9], Lagoudakis and Parr [35], Bartlett and Tewari [5], Tekin and Liu [44], Jouini et al [29], Dayanik, Powell, and Yamazaki [20], Filippi, Cappé, and Garivier [24], Osband and Van Roy [41]. As well as Burnetas and Katehakis [13], Audibert et al [1], Auer and Ortner [3], Gittins, Glazebrook, and Weber [25], Bubeck and Slivkins [6], Cappé et al [15], Kaufmann [33], Li, Munos, and Szepesvári [38], Cowan and Katehakis [17], Cowan and Katehakis [19], and references therein. For dynamic programming extensions we refer to Burnetas and Katehakis [11], Butenko, Pardalos, and Murphey [14], Tewari and Bartlett [45], Audibert et al [1], Littman [39], Feinberg, Kasyanov, and Zgurovsky [22] and references therein.…”

Section: (13)mentioning

confidence: 99%

Asymptotically Optimal Multi-Armed Bandit Policies Under a Cost Constraint

Burnetas¹,

Kanavetas

Katehakis

2016

Prob. Eng. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

We develop asymptotically optimal policies for the multi armed bandit (MAB), problem, under a cost constraint. This model is applicable in situations where each sample (or activation) from a population (bandit) incurs a known bandit dependent cost. Successive samples from each population are iid random variables with unknown distribution. The objective is to design a feasible policy for deciding from which population to sample from, so as to maximize the expected sum of outcomes of n total samples or equivalently to minimize the regret due to lack on information on sample distributions, For this problem we consider the class of feasible uniformly fast (f-UF) convergent policies, that satisfy the cost constraint sample-path wise. We first establish a necessary asymptotic lower bound for the rate of increase of the regret function of f-UF policies. Then we construct a class of f-UF policies and provide conditions under which they are asymptotically optimal within the class of f-UF policies, achieving this asymptotic lower bound. At the end we provide the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances. . Asymptotic optimality, finite horizon regret bounds, and a solution to an open problem. optimal Bayesian sequential change detection and identification rules. -armed bandit with budget constraint and variable costs. In AAAI-13, pages 232-238, 2013.Eugene A Feinberg, Pavlo O Kasyanov, and Michael Z Zgurovsky. Convergence of value iterations for total-cost mdps and pomdps with general state and action sets.

show abstract

Section: (13)mentioning

confidence: 99%

Asymptotically Optimal Multi-Armed Bandit Policies Under a Cost Constraint

Burnetas¹,

Kanavetas

Katehakis

2016

Prob. Eng. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

show abstract

“…This effect does not influence the long-term almost sure behavior of these policies. For other significant related recent work, we refer to Garivier et al [9], Lattimore [14], Ortner [16], Orabona and Pál [15], Cowan and Katehakis [5][6][7].…”

Section: Related Literaturementioning

confidence: 99%

Exploration–exploitation Policies With Almost Sure, Arbitrarily Slow Growing Asymptotic Regret

Cowan

Katehakis

2019

Prob. Eng. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

The purpose of this paper is to provide further understanding into the structure of the sequential allocation ("stochastic multi-armed bandit", or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies π.For any slowly increasing function g, subject to mild regularity constraints, we construct two policies (the g-Forcing, and the g-Inflated Sample Mean) that achieve a measure of regret of order O(g(n)) almost surely as n → ∞, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function g effectively controls the "exploration" of the classical "exploration/exploitation" tradeoff.

show abstract

“…Hence, we can invoke the restart problem introduced in Katehakis and Veinott, Jr. [17] and Cowan and Katehakis [8] to compute the robust indices. 1 Indeed, one can show that for a fixed initial…”

Section: Proposition 2 the Robust Gittins Index Is Given Bymentioning

confidence: 99%

Robust control of the multi-armed bandit problem

Caro

Gupta

2015

Ann Oper Res

View full text Add to dashboard Cite

We study a robust model of the multi-armed bandit (MAB) problem in which the transition probabilities are ambiguous and belong to subsets of the probability simplex. We first show that for each arm there exists a robust counterpart of the Gittins index that is the solution to a robust optimal stopping-time problem and can be computed effectively with an equivalent restart problem. We then characterize the optimal policy of the robust MAB as a project-by-project retirement policy but we show that arms become dependent so the policy based on the robust Gittins index is not optimal. For a project selection problem, we show that the robust Gittins index policy is near optimal but its implementation requires more computational effort than solving a non-robust MAB problem. Hence, we propose a Lagrangian index policy that requires the same computational effort as evaluating the indices of a non-robust MAB and is within 1% of the optimum in the robust project selection problem.

show abstract

Multi-Armed Bandits Under General Depreciation and Commitment

Cited by 16 publications

References 53 publications

Asymptotically Optimal Multi-Armed Bandit Policies Under a Cost Constraint

Asymptotically Optimal Multi-Armed Bandit Policies Under a Cost Constraint

Exploration–exploitation Policies With Almost Sure, Arbitrarily Slow Growing Asymptotic Regret

Robust control of the multi-armed bandit problem

Contact Info

Product

Resources

About