2021
DOI: 10.1287/opre.2020.2024
|View full text |Cite
|
Sign up to set email alerts
|

A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation

Abstract: Temporal difference learning (TD) is a simple iterative algorithm widely used for policy evaluation in Markov reward processes. Bhandari et al. prove finite time convergence rates for TD learning with linear function approximation. The analysis follows using a key insight that establishes rigorous connections between TD updates and those of online gradient descent. In a model where observations are corrupted by i.i.d. noise, convergence results for TD follow by essentially mirroring the analysis for online gra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

10
293
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 114 publications
(303 citation statements)
references
References 27 publications
10
293
0
Order By: Relevance
“…provided that an appropriate constant learning rate is adopted. We note that prior finite-sample analysis on asynchronous TD learning typically focused on (weighted) 2 estimation errors with linear function approximation [21], [22], and it is hence difficult to make fair comparisons. The recent paper [23] developed ∞ guarantees for TD learning, focusing on the synchronous settings with i.i.d.…”
Section: A Special Case: Td Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…provided that an appropriate constant learning rate is adopted. We note that prior finite-sample analysis on asynchronous TD learning typically focused on (weighted) 2 estimation errors with linear function approximation [21], [22], and it is hence difficult to make fair comparisons. The recent paper [23] developed ∞ guarantees for TD learning, focusing on the synchronous settings with i.i.d.…”
Section: A Special Case: Td Learningmentioning
confidence: 99%
“…The Q-learning algorithm, originally proposed in [29], has been analyzed in the asymptotic regime by [6], [7], [14], [30] since more than two decades ago. Additionally, finite-time performance of Q-learning and its variants have been analyzed by [2], [8]- [10], [19], [31]- [34] in the tabular setting, by [21], [35]- [43] in the context of function approximations, and by [44] with nonparametric regression. In addition, [11], [12], [24], [45]- [47] studied modified Q-learning algorithms that might potentially improve sample complexities and accelerate convergence.…”
Section: A the Q-learning Algorithm And Its Variantsmentioning
confidence: 99%
See 1 more Smart Citation
“…As has been discussed in, e.g., Reference [ 43 ], the non-asymptotic rate of convergence (the rate at which the "mean-path" of the TD or Q-learning algorithm converges) is exponential. The asymptotic rate of convergence can be analyzed by deriving asymptotic stochastic differential equation which shows a direct dependence of the asymptotic covariance of the value function estimate on the network connectivity [ 12 ].…”
Section: Consensus-based Distributed Joint Spectrum Sensing and Selectionmentioning
confidence: 99%
“…Error bounds for constant stepsize synchronous and asynchronous Q-learning algorithms were studied in [6] by combining the union bound and triangle inequality. Finite-time bounds for temporal difference learning for evaluating stationary policies with constant stepsize have been obtained in [47,9] under a variety of assumptions.…”
mentioning
confidence: 99%