PAC Bounds for Discounted MDPs

Lattimore, Tor; Hutter, Marcus

doi:10.1007/978-3-642-34106-9_26

Cited by 121 publications

(105 citation statements)

References 8 publications

Supporting

Mentioning

103

Contrasting

Order By: Relevance

“…In the case that more than two states are accessible from every state-action pair, the result ofLattimore and Hutter (2012a) translates to an upper bound of O(|X | 2 |A|/(ε 2(1 − γ )3 )) which has a quadratic dependency on the size of state space |X |, whereas our bounds, at least for small values of ε, scale linearly with |X | (see also Lattimore and Hutter 2012b).…”

mentioning

confidence: 91%

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

2013

View full text Add to dashboard Cite

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γ ∈ [0, 1) only O(N log(N/δ)/((1 − γ) 3 ε 2)) state-transition samples are required to find an ε-optimal estimation of the action-value function with the probability (w.p.) 1 − δ. Further, we prove that, for small values of ε, an order of O(N log(N/δ)/((1 − γ) 3 ε 2)) samples is required to find an ε-optimal policy w.p. 1 − δ. We also prove a matching lower bound of Θ(N log(N/δ)/((1 − γ) 3 ε 2)) on the sample complexity of estimating the optimal action-value function with ε accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of N , ε, δ and 1/(1 − γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1 − γ).

show abstract

mentioning

confidence: 91%

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

2013

View full text Add to dashboard Cite

show abstract

“…There is a large body of work on sampling-methods for MDPs in the literature of reinforcement learning, see e.g., [26,25,17,1] and many others. These works studied learning algorithms that updates parameters by drawing information from some oracle, where the sampling oracles and modeling assumptions vary.…”

Section: Previous Workmentioning

confidence: 99%

Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes

Sidford

Wang

et al. 2018

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

In this paper we provide faster algorithms for approximately solving discounted Markov Decision Processes in multiple parameter regimes. Given a discounted Markov Decision Process (DMDP) with |S| states, |A| actions, discount factor γ ∈ (0, 1), and rewards in the range [−M, M ], we show how to compute an -optimal policy, with probability 1 − δ in timeThis contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDP's for intermediate values of γ.We also show how to obtain improved sublinear time algorithms and provide an algorithm which computes an -optimal policy with probability 1 − δ in time(1 − γ) 4 2 log 1 δ provided we can sample from the transition function in O(1) time. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly.

show abstract

“…2. To obtain an additional factor of the horizon we proceed in the same fashion as the lower bound given by Lattimore and Hutter (2012). Adapt the environment again so that the agent stays in the decision node for exactly O( 1 1−γ ) time-steps, regardless of its action.…”

Section: Lower Bound On Sample-complexitymentioning

confidence: 99%

“…Aside from the previously mentioned papers, there has been little work on this problem, although sample-complexity bounds have been proven for MDPs (Lattimore and Hutter, 2012;Szita and Szepesvári, 2010;Kearns and Singh, 2002, and references there-in), as well as partially observable and factored MDPs (Chakraborty and Stone, 2011;Even-Dar et al, 2005). There is also a significant literature on the regret criterion for MDPs (Azar et al, 2013;Auer et al, 2010, and references there-in), but meaningful results cannot be obtained without a connectedness assumption that we avoid here.…”

Section: Introductionmentioning

confidence: 99%