2014
DOI: 10.1016/j.tcs.2014.09.029
|View full text |Cite
|
Sign up to set email alerts
|

Near-optimal PAC bounds for discounted MDPs

Abstract: Abstract. We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (mdps). We prove a new bound for a modified version of Upper Confidence Reinforcement Learning (ucrl) with only cubic dependence on the horizon. The bound is unimprovable in all parameters except the size of the state/action space, where it depends linearly on the number of non-zero transition probabilities. The lower bound strengthens previous work by being… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
40
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 34 publications
(40 citation statements)
references
References 6 publications
0
40
0
Order By: Relevance
“…Our lower bound instance is simplified from the instances in Azar et al [2013], Lattimore and Hutter [2014], Pananjady and Wainwright [2020].…”
Section: C1 Lower Bound Of Offline Evaluationmentioning
confidence: 99%
“…Our lower bound instance is simplified from the instances in Azar et al [2013], Lattimore and Hutter [2014], Pananjady and Wainwright [2020].…”
Section: C1 Lower Bound Of Offline Evaluationmentioning
confidence: 99%
“…Theorem IV.6. (Theorem 1, [110] and Theorem 11, [111]) In both the generative and online sampling models, for and δ small enough, there exists an MDP learning a sample complexity in Ω( nxnu 2 (1−γ) 3 log( nx δ )) (where γ denotes the discount factor).…”
Section: B Discounted Mdpsmentioning
confidence: 99%
“…2) The price of model-free approaches: Some model-based algorithms are known to match the minimax sample complexity lower bound. In the online sampling setting, the authors of [111] presents UCRL(γ), an extension of UCRL for discounted costs, and establish a minimax sample complexity upper bound matching the above lower bound. UCRL(γ) consists in deriving upper confidence bounds for the MDP parameters, and in selecting action optimistically (this can lead to important computational issues).…”
Section: B Discounted Mdpsmentioning
confidence: 99%
“…In particular, (Jaksch et al, 2010) provided a regret lower bound Ω( √ HSAT ) for H-horizon MDP. There is also a line of works studying the sample complexity of obtaining a value or policy that is at most ǫ-suboptimal (Kakade, 2003;Strehl et al, 2006Strehl et al, , 2009Szita and Szepesvári, 2010;Lattimore and Hutter, 2014;Azar et al, 2013;Dann and Brunskill, 2015;Sidford et al, 2018). The optimal sample complexity for finding an ǫoptimal policy is O |S||A|(1 − γ) −2 ǫ −2 (Sidford et al, 2018) for a discounted MDP with discount factor γ.…”
Section: Related Literaturementioning
confidence: 99%