2020
DOI: 10.48550/arxiv.2001.04515
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
42
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(43 citation statements)
references
References 40 publications
1
42
0
Order By: Relevance
“…With a Cramer-Rao lower bound established in Jiang and Li [2016], asymptotic efficiency of estimators using linear approximation has been discussed , Hao et al, 2021, Yin and Wang, 2020, Mou et al, 2020a as well as a semiparametric doubly robust estimator [Kallus and Uehara, 2020]. To estimate the optimal Q * , multi-stage algorithms have been proposed and their asymptotic behaviors have been analyzed [Luckett et al, 2019, Shi et al, 2020. We supplement these upper bound works with a semiparametric efficiency lower bound and show averaged Q-learning is the most efficient RAL estimator to achieve it.…”
Section: Related Workmentioning
confidence: 99%
“…With a Cramer-Rao lower bound established in Jiang and Li [2016], asymptotic efficiency of estimators using linear approximation has been discussed , Hao et al, 2021, Yin and Wang, 2020, Mou et al, 2020a as well as a semiparametric doubly robust estimator [Kallus and Uehara, 2020]. To estimate the optimal Q * , multi-stage algorithms have been proposed and their asymptotic behaviors have been analyzed [Luckett et al, 2019, Shi et al, 2020. We supplement these upper bound works with a semiparametric efficiency lower bound and show averaged Q-learning is the most efficient RAL estimator to achieve it.…”
Section: Related Workmentioning
confidence: 99%
“…However, both papers assume a generative model, where observation tuples are sampled independently from the stationary distribution of the underlying Markov decision process. Shi et al (2021) proposed an inference method for the state-action value (Q) function via sieve methods to approximate the Q-function. This is an offline method that directly computes the value estimates using batch updates.…”
Section: Related Work and Our Contributionsmentioning
confidence: 99%
“…Our goal is to develop an approach to estimate the value function under each candidate model during our policy optimization procedure with theoretical guarantee. The proposed algorithm is motivated by recent development in statistical inference of sequential decision making (Luedtke & Van Der Laan, 2016;Shi et al, 2020). The idea is to first estimate optimal Q-function Q * , optimal policy π * and the resulting ratio function based on a chunk of data, and evaluate the performance of the estimated policy on the next chunk of data using previously estimated nuisance functions.…”
Section: Sequential Model Selectionmentioning
confidence: 99%