2012
DOI: 10.1007/978-3-642-34106-9_26
|View full text |Cite
|
Sign up to set email alerts
|

PAC Bounds for Discounted MDPs

Abstract: Abstract. We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (mdps). We prove a new bound for a modified version of Upper Confidence Reinforcement Learning (ucrl) with only cubic dependence on the horizon. The bound is unimprovable in all parameters except the size of the state/action space, where it depends linearly on the number of non-zero transition probabilities. The lower bound strengthens previous work by being… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
103
2

Year Published

2013
2013
2023
2023

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 121 publications
(105 citation statements)
references
References 8 publications
0
103
2
Order By: Relevance
“…In the case that more than two states are accessible from every state-action pair, the result ofLattimore and Hutter (2012a) translates to an upper bound of O(|X | 2 |A|/(ε 2(1 − γ )3 )) which has a quadratic dependency on the size of state space |X |, whereas our bounds, at least for small values of ε, scale linearly with |X | (see also Lattimore and Hutter 2012b).…”
mentioning
confidence: 91%
“…In the case that more than two states are accessible from every state-action pair, the result ofLattimore and Hutter (2012a) translates to an upper bound of O(|X | 2 |A|/(ε 2(1 − γ )3 )) which has a quadratic dependency on the size of state space |X |, whereas our bounds, at least for small values of ε, scale linearly with |X | (see also Lattimore and Hutter 2012b).…”
mentioning
confidence: 91%
“…There is a large body of work on sampling-methods for MDPs in the literature of reinforcement learning, see e.g., [26,25,17,1] and many others. These works studied learning algorithms that updates parameters by drawing information from some oracle, where the sampling oracles and modeling assumptions vary.…”
Section: Previous Workmentioning
confidence: 99%
“…2. To obtain an additional factor of the horizon we proceed in the same fashion as the lower bound given by Lattimore and Hutter (2012). Adapt the environment again so that the agent stays in the decision node for exactly O( 1 1−γ ) time-steps, regardless of its action.…”
Section: Lower Bound On Sample-complexitymentioning
confidence: 99%
“…Aside from the previously mentioned papers, there has been little work on this problem, although sample-complexity bounds have been proven for MDPs (Lattimore and Hutter, 2012;Szita and Szepesvári, 2010;Kearns and Singh, 2002, and references there-in), as well as partially observable and factored MDPs (Chakraborty and Stone, 2011;Even-Dar et al, 2005). There is also a significant literature on the regret criterion for MDPs (Azar et al, 2013;Auer et al, 2010, and references there-in), but meaningful results cannot be obtained without a connectedness assumption that we avoid here.…”
Section: Introductionmentioning
confidence: 99%