2020
DOI: 10.48550/arxiv.2003.07337
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

Abstract: We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the ∞ -error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 8 publications
(13 citation statements)
references
References 24 publications
0
13
0
Order By: Relevance
“…A number of impactful analysis techniques have been developed for this setting with corresponding minimax bounds (Azar et al, 2013;Sidford et al, 2018;Agarwal et al, 2020;Li et al, 2020). Recently, several instance-dependent results have been shown in the generative model setting Khamaru et al, 2020Khamaru et al, , 2021. Most relevant is the work of which proposes the Bespoke algorithm and achieves a sample complexity of s,a log(1/δ) max{ 2 ,∆(s,a) 2 } , ignoring horizon dependence.…”
Section: Related Workmentioning
confidence: 99%
“…A number of impactful analysis techniques have been developed for this setting with corresponding minimax bounds (Azar et al, 2013;Sidford et al, 2018;Agarwal et al, 2020;Li et al, 2020). Recently, several instance-dependent results have been shown in the generative model setting Khamaru et al, 2020Khamaru et al, , 2021. Most relevant is the work of which proposes the Bespoke algorithm and achieves a sample complexity of s,a log(1/δ) max{ 2 ,∆(s,a) 2 } , ignoring horizon dependence.…”
Section: Related Workmentioning
confidence: 99%
“…Establishing information-theoretic or algorithmic-specific lower bounds on the statistical and computational complexities of RL algorithms -often achieved by constructing hard MDP instances -plays an instrumental role in understanding the bottlenecks of RL algorithms. To give a few examples, Azar et al (2013) established an information-theoretic lower bound on the sample complexity of learning the optimal policy in a generative model, whereas Khamaru et al (2020); Pananjady and Wainwright (2020) developed instance-dependent lower bounds for policy evaluation. Additionally, Agarwal et al (2019) constructed a chain-like MDP whose value function under direct parameterization might contain very flat saddle points under a certain initial state distribution, highlighting the role of distribution mismatch coefficients in policy optimization.…”
Section: Other Related Workmentioning
confidence: 99%
“…The seminal idea of variance reduction was originally proposed to accelerate finite-sum stochastic optimization, e.g., Gower et al (2020); Johnson and Zhang (2013); Nguyen et al (2017). Thereafter, the variance reduction strategy has been imported to RL, which assists in improving the sample efficiency of RL algorithms in multiple contexts, including but not limited to policy evaluation (Du et al, 2017;Khamaru et al, 2020;Wai et al, 2019;Xu et al, 2019), RL with a generative model (Sidford et al, 2018a,b;Wainwright, 2019b), asynchronous Q-learning (Li et al, 2020b), and offline RL (Yin et al, 2021).…”
Section: Related Workmentioning
confidence: 99%