2021
DOI: 10.48550/arxiv.2112.13109
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Accelerated and instance-optimal policy evaluation with linear function approximation

Abstract: We study the problem of policy evaluation with linear function approximation and present efficient and practical algorithms that come with strong optimality guarantees. We begin by proving lower bounds that establish baselines on both the deterministic error and stochastic error in this problem. In particular, we prove an oracle complexity lower bound on the deterministic error in an instance-dependent norm associated with the stationary distribution of the transition kernel, and use the local asymptotic minim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(7 citation statements)
references
References 28 publications
0
7
0
Order By: Relevance
“…It is an instance-dependent characterization of the stochastic error for solving the projected Bellman equation (3.13) with sample access to the transition kernel. A direct extension of Proposition 1 of Li et al (2021) demonstrates that this term matches the asymptotic instance-dependent lower bound on stochastic error of solving Eq. (3.13) with i.i.d.…”
mentioning
confidence: 54%
See 3 more Smart Citations
“…It is an instance-dependent characterization of the stochastic error for solving the projected Bellman equation (3.13) with sample access to the transition kernel. A direct extension of Proposition 1 of Li et al (2021) demonstrates that this term matches the asymptotic instance-dependent lower bound on stochastic error of solving Eq. (3.13) with i.i.d.…”
mentioning
confidence: 54%
“…• Policy evaluation for AMDPs (Critic): We first propose a simple and novel multiple trajectory method for policy evaluation in the generative model, which achieves O(t mix log(1/ )) sample complexity for ∞ -bound on the bias of the estimators, as well as O(t 2 mix / ) sample complexity for the expected squared ∞ -error of the estimators. For the on-policy evaluation under Markovian noise, we develop an average-reward variant of the variance-reduced temporal difference (VRTD) algorithm (Khamaru et al, 2021;Li et al, 2021) with linear function approximation, which achieves O(t 3 mix log(1/ )) sample complexity for weighted 2error of the bias of the estimators, as well as an instance-dependent sample complexity for expected weighted 2 -error of the estimators. The latter sample complexity improved the one in Zhang et al (2021b) by a factor of O(t 2 mix ).…”
Section: Main Contributionsmentioning
confidence: 99%
See 2 more Smart Citations
“…For stochastic gradient (SG) methods in the Euclidean setting, such bounds have been established for Polyak-Ruppert-averaged SG [MB11, GP17] and variancereduced SG algorithms [FGKS15,LMWJ20], with the sample complexity and high-order terms being improved over time. For reinforcement learning problems, such type of guarantees have been established in the • ∞ norm for temporal difference methods [KPR + 20] and Q-learning [KXWJ21] under a generative model, as well as Markovian trajectories [MPWB21,LLP21] under the ℓ 2 -norm. In the context of stochastic optimization, the paper [LMWJ20] provides fine-grained bound for ROOT-SGD with a unity pre-factor on the leading-order instancedependent term.…”
Section: Stochastic Approximation and Asymptotic Guaranteesmentioning
confidence: 99%