2021
DOI: 10.48550/arxiv.2106.13125
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation

Abstract: Model-agnostic meta-reinforcement learning requires estimating the Hessian matrix of value functions. This is challenging from an implementation perspective, as repeatedly differentiating policy gradient estimates may lead to biased Hessian estimates. In this work, we provide a unifying framework for estimating higherorder derivatives of value functions, based on off-policy evaluation. Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
5
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(6 citation statements)
references
References 21 publications
1
5
0
Order By: Relevance
“…This is mainly because practical algorithms can only estimate ∇ g V g (θ N ) instead of ∇ g V g (θ ), while the latter is required to estimate J ∞ (θ, g) in an unbiased way. This observation was also hinted at recently in (Tang et al, 2021).…”
Section: Discussion On Prior Worksupporting
confidence: 81%
See 4 more Smart Citations
“…This is mainly because practical algorithms can only estimate ∇ g V g (θ N ) instead of ∇ g V g (θ ), while the latter is required to estimate J ∞ (θ, g) in an unbiased way. This observation was also hinted at recently in (Tang et al, 2021).…”
Section: Discussion On Prior Worksupporting
confidence: 81%
“…Prior work in fact constructs the LSF gradient estimate. Since most prior work derive meta-RL gradient estimates based on J ∞ (θ, g) (Foerster et al, 2018;Rothfuss et al, 2018;Liu et al, 2019;Tang et al, 2021), and due to the accidental replacement of θ by θ N , we conclude that they in fact construct variants of the LSF gradient estimate (see comments following Corollary 4.3). In particular, they construct Ĵ such that E[ Ĵ] = E[ ĴN,LSF (θ, g)] but with potentially lower variance.…”
Section: Discussion On Prior Workmentioning
confidence: 80%
See 3 more Smart Citations