For the reader's convenience, we provide a statement and a self-contained proof of the average cost optimality equation for MDPs. The material in this section is standard.We let an irreducible MDP M with finite primitives be given. The state space is S, the action set is A, the reward function is r : S × A → R, and the transition function is p(· | s a). 56 We let Σ denote the set of strategies in M. For δ < 1 and N ∈ N, we letn−1 r(s n a n ) andr(s n a n )denote the values of the discounted and finite-horizon versions of M, as a function of the initial state s.
PROPOSITION 6-ACOE:There is a unique v ∈ R and a unique (up to an additive constant) map θ : S → R such that56 We are thus assuming that the sets S and A are finite, and that for each policy ρ : S → Δ(A), the induced Markov chain (s n ) is irreducible.