Discrete-Time Controlled Markov Processes with Average Cost Criterion: A Survey

Arapostathis, Ari; Borkar, Vivek S.; Fernández-Gaucherand, E.; Ghosh, Mrinal K.; Marcus, Solomon

doi:10.1137/0331018

Cited by 529 publications

(305 citation statements)

References 156 publications

Supporting

Mentioning

290

Contrasting

Unclassified

Order By: Relevance

“…The literature on average cost MDPs is vast. Most of the earlier results are surveyed in Arapostathis et al [1]. Here, we mention just a few references.…”

mentioning

confidence: 93%

Average Cost Markov Decision Processes with Weakly Continuous Transition Probabilities

2012

View full text Add to dashboard Cite

This paper presents sufficient conditions for the existence of stationary optimal policies for average cost Markov decision processes with Borel state and action sets and weakly continuous transition probabilities. The one-step cost functions may be unbounded, and the action sets may be noncompact. The main contributions of this paper are: (i) general sufficient conditions for the existence of stationary discount optimal and average cost optimal policies and descriptions of properties of value functions and sets of optimal actions, (ii) a sufficient condition for the average cost optimality of a stationary policy in the form of optimality inequalities, and (iii) approximations of average cost optimal actions by discount optimal actions.Key words: Markov decision process; average cost per unit time; optimality inequality; optimal policy MSC2000 subject classification: Primary: 90C40; secondary: 90C39 OR/MS subject classification: Primary: dynamic programming/optimal control; secondary: Markov, infinite state 1. Introduction. This paper provides sufficient conditions for the existence of stationary optimal policies for average cost Markov decision processes (MDPs) with Borel state and action sets and weakly continuous transition probabilities. The cost functions may be unbounded, and the action sets may be noncompact. The main contributions of this paper are: (i) general sufficient conditions for the existence of stationary discount optimal and average cost optimal policies and descriptions of properties of value functions and sets of optimal actions (Theorems 1, 3, and 4), (ii) a new sufficient condition for average cost optimality based on optimality inequalities (Theorem 2), and (iii) approximations of average cost optimal actions by discount optimal actions (Theorem 5).For infinite-horizon MDPs, there are two major criteria: average costs per unit time and expected total discounted costs. The former is typically more difficult to analyze. The so-called vanishing discount factor approach is often used to approximate average costs per unit time by normalized expected total discounted costs. The literature on average cost MDPs is vast. Most of the earlier results are surveyed in Arapostathis et al. [1]. Here, we mention just a few references.For finite-state and action sets, Derman [10] proved the existence of stationary average cost optimal policies. This result follows from Blackwell [6] and it also was independently proved by Viskov and Shiryaev [31]. When either the state set or action set is infinite, even -optimal policies may not exist for some > 0; Ross [25], Dynkin and Yushkevich [11, Chapter 7], Feinberg [12, §5]. For a finite-state set and compact action sets, optimal policies may not exist; Bather [2], Chitashvili [9], and Dynkin and Yushkevich [11, Chapter 7].For MDPs with finite-state and action sets, there exist stationary policies satisfying optimality equations (see Dynkin and Yushkevich [11, Chapter 7], where these equations are called canonical), and, furthermore, any stationary policy satisfy...

show abstract

“…The literature on average cost MDPs is vast. Most of the earlier results are surveyed in Arapostathis et al [1]. Here, we mention just a few references.…”

mentioning

confidence: 93%

Average Cost Markov Decision Processes with Weakly Continuous Transition Probabilities

2012

View full text Add to dashboard Cite

show abstract

“…245-247) is discussed in the online supplement. For average reward, a popular approach, called the vanishing-discount approach (Arapostathis et al, 1993), employs any discounted RL algorithm with a very small positive value for R. Alternatively, one can use R-SMART (Gosavi, 2004a) that differs from Q-Learning (Figure 6) for MDPs as follows. In Step 1, ρ, which denotes the current estimate of the optimal average reward, is set to 0 along with T r and T t , which are also set to 0.…”

Section: Semi-markov Decision Problemsmentioning

confidence: 99%

Reinforcement Learning: A Tutorial Survey and Recent Advances

Gosavi

2009

INFORMS Journal on Computing

252

View full text Add to dashboard Cite

In the last few years, Reinforcement Learning (RL), also called adaptive (or approximate) dynamic programming (ADP), has emerged as a powerful tool for solving complex sequential decision-making problems in control theory. Although seminal research in this area was performed in the artificial intelligence (AI) community, more recently, it has attracted the attention of optimization theorists because of several noteworthy success stories from operations management. It is on large-scale and complex problems of dynamic optimization, in particular the Markov decision problem (MDP) and its variants, that the power of RL becomes more obvious. It has been known for many years that on large-scale MDPs, the curse of dimensionality and the curse of modeling render classical dynamic programming (DP) ineffective. The excitement in RL stems from its direct attack on these curses, allowing it to solve problems that were considered intractable, via classical DP, in the past. The success of RL is due to its strong mathematical roots in the principles of DP, Monte Carlo simulation, function approximation, and AI. Topics treated in some detail in this survey are: Temporal differences, Q-Learning, semi-MDPs and stochastic games. Several recent advances in RL, e.g., policy gradients and hierarchical RL, are covered along with references. Pointers to numerous examples of applications are provided. This overview is aimed at uncovering the mathematical roots of this science, so that readers gain a clear understanding of the core concepts and are able to use them in their own research. The survey points to more than 100 references from the literature.

show abstract

“…For a more substantial introduction, see Puterman's book [33] or the survey paper by Arapostathis et al [1]. We consider an MDP with a countable state set X, a finite action set A, a nonnegative and bounded reward function R such that R : X × A → R + , and a state transition function P that maps the state and action pair to a probability distribution over X.…”

Section: Markov Decision Processmentioning

confidence: 99%

“…(2), obtaining a function h π and J π ∞ for π. Note that the function h π that satisfies the Poisson's equation with respect to π is not necessarily unique [1,33]. Under Assumption 2.1, the following function known as the "relative value function"…”

Section: Parallel Rolloutmentioning

confidence: 99%

See 1 more Smart Citation

Approximate Receding Horizon Approach for Markov Decision Processes: Average Award Case

Chang¹,

Marcus²

2002

Self Cite

View full text Add to dashboard Cite

ISR develops, applies and teaches advanced methodologies of design and analysis to solve complex, hierarchical, heterogeneous and dynamic problems of engineering technology and systems for industry and government. ISR is a permanent institute of the University of Maryland, within the Glenn L. Martin Institute of Technol AbstractWe consider an approximation scheme for solving Markov Decision Processes (MDPs) with countable state space, finite action space, and bounded rewards that uses an approximate solution of a fixed finite-horizon sub-MDP of a given infinite-horizon MDP to create a stationary policy, which we call "approximate receding horizon control". We first analyze the performance of the approximate receding horizon control for infinite-horizon average reward under an ergodicity assumption, which also generalizes the result obtained by White [36]. We then study two examples of the approximate receding horizon control via lower bounds to the exact solution to the sub-MDP. The first control policy is based on a finite-horizon approximation of Howard's policy improvement of a single policy and the second policy is based on a generalization of the single policy improvement for multiple policies. Along the study, we also provide a simple alternative proof on the policy improvement for countable state space. We finally discuss practical implementations of these schemes via simulation.

show abstract

Discrete-Time Controlled Markov Processes with Average Cost Criterion: A Survey

Cited by 529 publications

References 156 publications

Average Cost Markov Decision Processes with Weakly Continuous Transition Probabilities

Average Cost Markov Decision Processes with Weakly Continuous Transition Probabilities

Reinforcement Learning: A Tutorial Survey and Recent Advances

Approximate Receding Horizon Approach for Markov Decision Processes: Average Award Case

Contact Info

Product

Resources

About