2020
DOI: 10.48550/arxiv.2005.00527
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Abstract: Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is ε near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
28
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
7

Relationship

4
3

Authors

Journals

citations
Cited by 16 publications
(29 citation statements)
references
References 10 publications
(8 reference statements)
1
28
0
Order By: Relevance
“…The result in [WDYK20] shows a surprising fact that one can achieve similar sample efficiency regardless of the horizon length of the RL problem. A number of recent results [ZJD20, MDSV21, ZDJ20, RLD + 21] additionally improve the result in [WDYK20] to obtain more (computationally or statistically) efficient algorithm and/or in more general settings. However, these results all have the poly((log H)/ǫ) factor in their sample complexity, seemingly implying that the polylog(H) dependence is necessary.…”
Section: Introductionmentioning
confidence: 92%
See 3 more Smart Citations
“…The result in [WDYK20] shows a surprising fact that one can achieve similar sample efficiency regardless of the horizon length of the RL problem. A number of recent results [ZJD20, MDSV21, ZDJ20, RLD + 21] additionally improve the result in [WDYK20] to obtain more (computationally or statistically) efficient algorithm and/or in more general settings. However, these results all have the poly((log H)/ǫ) factor in their sample complexity, seemingly implying that the polylog(H) dependence is necessary.…”
Section: Introductionmentioning
confidence: 92%
“…Under , where the second term still scales polynomially with H. Wang et al [WDYK20] show that it is possible to obtain an ǫ-optimal policy with a sample complexity of poly(|S||A|/ǫ) • log 3 H, establishing the first sample complexity with polylog(H) dependence on the horizon length H. They achieve this result by using the following ideas: (1) samples collected by different policies can be reused to evaluate other policies; (2) to evaluate all policies in a finite set Π, the number of sample episodes required is at most poly(|S||A|/ǫ) • log |Π| • log 2 H; (3) establish a set of policies Π that contains at least one ǫ-optimal policy for any MDP by using an ǫ-nets over the reward values and the transition probabilities. Here Π contains all optimal non-stationary policies of each MDP in the ǫ-net.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Worst-Case Regret Bounds in Tabular RL. A significant amount of work has been devoted to obtaining worst-case optimal bounds in the setting of tabular RL (Kearns & Singh, 2002;Kakade, 2003;Azar et al, 2017;Dann et al, 2017;Jin et al, 2018;Dann et al, 2019;Wang et al, 2020;Zhang et al, 2020b,a). These approaches fall into both the model-based (Azar et al, 2017;Dann et al, 2017) as well as the model-free category (Jin et al, 2018).…”
Section: Related Workmentioning
confidence: 99%