Optimal Adaptive Policies for Markov Decision Processes

Burnetas, Apostolos; Katehakis, Michael N.

doi:10.1287/moor.22.1.222

Cited by 190 publications

(137 citation statements)

References 27 publications

Supporting

Mentioning

134

Contrasting

Unclassified

Order By: Relevance

“…In particular, assuming irreducibility of the transition matrices, an asymptotically logarithmic regret is possible (Burnetas and Katehakis, 1997;Tewari and Bartlett, 2008): L A (T ) ≤ C · ln T for some constant C. Unfortunately, the value of C is depends on the dynamics of the underlying MDP and in fact can be arbitrarily large.…”

Section: Regret Minimizationmentioning

confidence: 99%

Sample Complexity Bounds of Exploration

2012

Adaptation, Learning, and Optimization

View full text Add to dashboard Cite

Efficient exploration is widely recognized as a fundamental challenge inherent in reinforcement learning. Algorithms that explore efficiently converge faster to near-optimal policies. While heuristics techniques are popular in practice, they lack formal guarantees and may not work well in general. This chapter studies algorithms with polynomial sample complexity of exploration, both model-based and model-free ones, in a unified manner. These so-called PAC-MDP algorithms behave near-optimally except in a "small" number of steps with high probability. A new learning model known as KWIK is used to unify most existing model-based PAC-MDP algorithms for various subclasses of Markov decision processes. We also compare the sample-complexity framework to alternatives for formalizing exploration efficiency such as regret minimization and Bayes optimal solutions.

show abstract

Section: Regret Minimizationmentioning

confidence: 99%

Sample Complexity Bounds of Exploration

2012

Adaptation, Learning, and Optimization

View full text Add to dashboard Cite

show abstract

“…Katehakis and Robbins [32], Burnetas et al [7], Burnetas and Katehakis [8], Ortner and Auer [40], Oksanen et al [39]. For other related work we refer to the following: Flint et al [17], Fernández-Gaucherand et al [15], Govindarajulu and Katehakis [25], Honda and Takemura [26], Tekin and Liu [47], Tewari and Bartlett [48], Filippi et al [16], Bertsekas [4], Bubeck and Cesa-Bianchi [5] and Burnetas et al [6].…”

Section: Introductionmentioning

confidence: 99%

Multi-Armed Bandits Under General Depreciation and Commitment

Cowan

Katehakis

2014

Prob. Eng. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor β ∈ (0, 1) per round.In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as "decisions over state space" but rather "decisions over time".

show abstract

“…These papers have a different objective than ours as they focus on minimizing regret by constructing adaptive index policies that possess optimal increase rate properties. This approach has been extended to finite state and action MDPs with incomplete information (Burnetas and Katehakis [6]) and to adversarial bandits that either make no assumption whatsoever on the process generating the payoffs of the bandits (Auer et al [1]) or bound its variation within a "variation budget" (Besbes et al [4]). At the time of submission we became aware of the work by Kim and Lim [18] that also study the RMAB problem but with an alternative formulation in which deviations of the transition probabilities from their point estimates are penalized, so the analysis is essentially different from ours.…”

Section: Introductionmentioning

confidence: 99%

Robust control of the multi-armed bandit problem

Caro

Gupta

2015

Ann Oper Res

View full text Add to dashboard Cite

We study a robust model of the multi-armed bandit (MAB) problem in which the transition probabilities are ambiguous and belong to subsets of the probability simplex. We first show that for each arm there exists a robust counterpart of the Gittins index that is the solution to a robust optimal stopping-time problem and can be computed effectively with an equivalent restart problem. We then characterize the optimal policy of the robust MAB as a project-by-project retirement policy but we show that arms become dependent so the policy based on the robust Gittins index is not optimal. For a project selection problem, we show that the robust Gittins index policy is near optimal but its implementation requires more computational effort than solving a non-robust MAB problem. Hence, we propose a Lagrangian index policy that requires the same computational effort as evaluating the indices of a non-robust MAB and is within 1% of the optimum in the robust project selection problem.

show abstract

Optimal Adaptive Policies for Markov Decision Processes

Cited by 190 publications

References 27 publications

Sample Complexity Bounds of Exploration

Sample Complexity Bounds of Exploration

Multi-Armed Bandits Under General Depreciation and Commitment

Robust control of the multi-armed bandit problem

Contact Info

Product

Resources

About