2021
DOI: 10.1007/s10994-020-05912-5
|View full text |Cite
|
Sign up to set email alerts
|

Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

Abstract: We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an O(d) improvement in complexity in comparison to LSTD, where d is the dimension of the data. We provide non-asymptotic bounds for our proposed m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(22 citation statements)
references
References 22 publications
0
21
0
Order By: Relevance
“…Therefore, it is better to use the random variable Z as an unbiased estimate of μ than X. From the formula, it can be seen that the variance of Z is sufficiently small as long as the random variable X is guaranteed to show a certain correlation with Y [3], so Y is also called the control variable of X. This is the control variable method.…”
Section: Controlled Variable Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, it is better to use the random variable Z as an unbiased estimate of μ than X. From the formula, it can be seen that the variance of Z is sufficiently small as long as the random variable X is guaranteed to show a certain correlation with Y [3], so Y is also called the control variable of X. This is the control variable method.…”
Section: Controlled Variable Methodsmentioning
confidence: 99%
“…For SGD variance problem, there are three mainstream methods to reduce the variance of sampling at present that include importance sampling, hierarchical sampling method and control variable method. The objective function in machine learning is usually solved using the Batch Gradient Descent (BGD) or SGD [3]. BGD algorithm computes the gradients of all samples for each iteration to perform the weight update, and the latter randomly selects one training sample at a time to update the parameters by computing the sample gradients.…”
Section: Introductionmentioning
confidence: 99%
“…) in (33) by using arguments similar to those used in arriving at Eq. (79) in [37]. In particular, the latter bound uses Jensen's inequality and the convexity of f (x) = x −2α exp(x 1−α ).…”
Section: A3 Proof Of Theoremmentioning
confidence: 99%
“…Analysis of TD algorithms is challenging, and researchers have devoted significant effort in studying its asymptotic properties [7,11,15,19]. In recent years, there has been an interest in characterising the finite-time behaviour of TD, and several papers [1,2,3,9,13] have tackled this problem under various assumptions. For iterations/updates, most existing works either provide a 1 (with universal step-size) [1,3] or a 1 (with constant step size) [1,9,13] convergence rate to the TD-fixed point ★ defined as ★ −1 (see Section 2 for the notational information).…”
Section: Introductionmentioning
confidence: 99%