Discounted deterministic Markov decision processes and discounted all-pairs shortest paths

Madani, Omid; Thorup, Mikkel; Zwick, Uri

doi:10.1145/1721837.1721849

Cited by 16 publications

(29 citation statements)

References 27 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The fastest known algorithm for uniformly discounted deterministic MDPs runs in time O(mn) [MTZ10]. However, these problems were not known to be solvable in polynomial time with the more-generic simplex method.…”

Section: Introductionmentioning

confidence: 99%

The simplex method is strongly polynomial for deterministic Markov decision processes

Post¹,

Ye²

2013

Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

We prove that the simplex method with the highest gain/most-negative-reduced cost pivoting rule converges in strongly polynomial time for deterministic Markov decision processes (MDPs) regardless of the discount factor. For a deterministic MDP with n states and m actions, we prove the simplex method runs in O(n 3 m 2 log 2 n) iterations if the discount factor is uniform and O(n 5 m 3 log 2 n) iterations if each action has a distinct discount factor. Previously the simplex method was known to run in polynomial time only for discounted MDPs where the discount was bounded away from 1 [Ye11].Unlike in the discounted case, the algorithm does not greedily converge to the optimum, and we require a more complex measure of progress. We identify a set of layers in which the values of primal variables must lie and show that the simplex method always makes progress optimizing one layer, and when the upper layer is updated the algorithm makes a substantial amount of progress. In the case of nonuniform discounts, we define a polynomial number of "milestone" policies and we prove that, while the objective function may not improve substantially overall, the value of at least one dual variable is always making progress towards some milestone, and the algorithm will reach the next milestone in a polynomial number of steps.

show abstract

Section: Introductionmentioning

confidence: 99%

The simplex method is strongly polynomial for deterministic Markov decision processes

Post¹,

Ye²

2013

Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

show abstract

“…When the IC paths are found, the lowest payoffs for each player in Step 2 can be found in O(mn) time (Madani et al 2010;Papadimitriou and Tsitsiklis 1987), where n is the number of nodes and m is the number of edges in the finite graph of IC paths. The task is essentially the same as finding the optimal strategies for discounted, infinite-horizon, deterministic Markov decision processes (DMDPs).…”

Section: Methodsmentioning

confidence: 99%

Extremal Pure Strategies and Monotonicity in Repeated Games

Berg

2016

Comput Econ

View full text Add to dashboard Cite

The recent development of computational methods in repeated games has made it possible to study the properties of subgame-perfect equilibria in more detail. This paper shows that the lowest equilibrium payoffs may increase in pure strategies when the players become more patient and this may cause the set of equilibrium paths to be non-monotonic. A numerical example is constructed such that a path is no longer equilibrium when the players' discount factors increase. This property can be more easily seen when the players have different time preferences, since in these games the punishment strategies may rely on the differences between the players' discount factors. A sufficient condition for the monotonicity of equilibrium paths is that the lowest equilibrium payoffs do not increase, i.e., the punishments should not become milder.

show abstract

“…If we look beyond PI, we find even subexponential bounds on the expected running time of MDP planning. Bounds of the form poly(n, k) • exp(O( n log(n))) [Matoušek et al, 1996] follow directly from posing MDP planning as a linear program with n variables and nk constraints [Littman et al, 1995]. The special structure that results when k = 2 admits an even tighter bound of poly(n) • exp(2 √ n) [Gärtner, 2002].…”

Section: Related Work and Contributionmentioning

confidence: 99%

“…Alternatively, if we fix n, the linear programming route can yield strong worst-case bounds that are linear in k: for example, Megiddo, 1984] and n O(n) • k [Chazelle and Matousek, 1996]. It must also be noted that for deterministic MDPs, strong worst-case bounds of the form poly(n, k) are indeed possible [Madani et al, 2010;Post and Ye, 2013].…”

Section: Related Work and Contributionmentioning

confidence: 99%

Improved Strong Worst-case Upper Bounds for MDP Planning

Gupta

Kalyanakrishnan

2017

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

The Markov Decision Problem (MDP) plays a central role in AI as an abstraction of sequential decision making. We contribute to the theoretical analysis of MDP planning, which is the problem of computing an optimal policy for a given MDP. Specifically, we furnish improved strong worstcase upper bounds on the running time of MDP planning. Strong bounds are those that depend only on the number of states n and the number of actions k in the specified MDP; they have no dependence on affiliated variables such as the discount factor and the number of bits needed to represent the MDP. Worst-case bounds apply to every run of an algorithm; randomised algorithms can typically yield faster expected running times. While the special case of 2-action MDPs (that is, k = 2) has recently received some attention, bounds for general k have remained to be improved for several decades. Our contributions are to this general case. For k ≥ 3, the tightest strong upper bound shown to date for MDP planning belongs to a family of algorithms called Policy Iteration. This bound is only a polynomial improvement over a trivial bound of poly(n, k) · k n [Mansour and Singh, 1999]. In this paper, we generalise a contrasting algorithm called the Fibonacci Seesaw, and derive a bound of poly(n, k) · k 0.6834n . The key construct that we use is a template to map algorithms for the 2-action setting to the general setting. Interestingly, this idea can also be used to design Policy Iteration algorithms with a running time upper bound of poly(n, k)·k 0.7207n . Both our results improve upon bounds that have stood for several decades.

show abstract

Discounted deterministic Markov decision processes and discounted all-pairs shortest paths

Cited by 16 publications

References 27 publications

The simplex method is strongly polynomial for deterministic Markov decision processes

The simplex method is strongly polynomial for deterministic Markov decision processes

Extremal Pure Strategies and Monotonicity in Repeated Games

Improved Strong Worst-case Upper Bounds for MDP Planning

Contact Info

Product

Resources

About