2021
DOI: 10.1137/20m1331524
|View full text |Cite
|
Sign up to set email alerts
|

Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

Abstract: We consider the problem of stochastic convex optimization under convex constraints. We analyze the behavior of a natural variance reduced proximal gradient (VRPG) algorithm for this problem. Our main result is a non-asymptotic guarantee for VRPG algorithm. Contrary to minimax worst case guarantees, our result is instance-dependent in nature. This means that our guarantee captures the complexity of the loss function, the variability of the noise, and the geometry of the constraint set. We show that the non-asym… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 15 publications
(7 citation statements)
references
References 24 publications
(41 reference statements)
0
6
1
Order By: Relevance
“…In this non-asymptotic setting, much of this work is focused on either the tabular case, or the simpler setting of linear function approximation, as opposed to the non-parametric cases of interest here. We note that our results do depend on the problem instance, but this instance-dependence is not (yet) as sharp as that established in the simpler setting of tabular problems [31,19].…”
Section: Related Workcontrasting
confidence: 69%
See 2 more Smart Citations
“…In this non-asymptotic setting, much of this work is focused on either the tabular case, or the simpler setting of linear function approximation, as opposed to the non-parametric cases of interest here. We note that our results do depend on the problem instance, but this instance-dependence is not (yet) as sharp as that established in the simpler setting of tabular problems [31,19].…”
Section: Related Workcontrasting
confidence: 69%
“…Our study leaves open a number of intriguing questions; let us mention a few of them here to conclude. First, although our bounds are instance-dependent, this dependence is not as refined as recent results in the simpler tabular and linear function settings [19,25]. In particular, our current results do not explicitly track the mixing properties of the transition kernel, which should enter in any such refined analysis.…”
Section: Discussionmentioning
confidence: 67%
See 1 more Smart Citation
“…• Policy evaluation for AMDPs (Critic): We first propose a simple and novel multiple trajectory method for policy evaluation in the generative model, which achieves O(t mix log(1/ )) sample complexity for ∞ -bound on the bias of the estimators, as well as O(t 2 mix / ) sample complexity for the expected squared ∞ -error of the estimators. For the on-policy evaluation under Markovian noise, we develop an average-reward variant of the variance-reduced temporal difference (VRTD) algorithm (Khamaru et al, 2021;Li et al, 2021) with linear function approximation, which achieves O(t 3 mix log(1/ )) sample complexity for weighted 2error of the bias of the estimators, as well as an instance-dependent sample complexity for expected weighted 2 -error of the estimators. The latter sample complexity improved the one in Zhang et al (2021b) by a factor of O(t 2 mix ).…”
Section: Main Contributionsmentioning
confidence: 99%
“…However, the consequence of accumulating gradients is that as the training increases, the step size decreases and eventually the training stalls. To address the training stagnation problem that occurs with AdaGrad, Hinton, G. proposed RMSProp (Root Mean Square Prop) to calculate the cumulative gradient using a moving average method that only accumulates the gradient of one window, making the change in step size adapt to the current gradient thus achieving better optimization [7,8]. And for SGD stochasticity introduces the variance problem, where SVRG is used to correct the gradient used for each model update using the global average gradient information [9].…”
Section: Introductionmentioning
confidence: 99%