Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings

Shi, Chengchun; Zhang, Shengxing; Song, Rui

doi:10.48550/arxiv.2001.04515

Cited by 16 publications

(43 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With a Cramer-Rao lower bound established in Jiang and Li [2016], asymptotic efficiency of estimators using linear approximation has been discussed , Hao et al, 2021, Yin and Wang, 2020, Mou et al, 2020a as well as a semiparametric doubly robust estimator [Kallus and Uehara, 2020]. To estimate the optimal Q * , multi-stage algorithms have been proposed and their asymptotic behaviors have been analyzed [Luckett et al, 2019, Shi et al, 2020. We supplement these upper bound works with a semiparametric efficiency lower bound and show averaged Q-learning is the most efficient RAL estimator to achieve it.…”

Section: Related Workmentioning

confidence: 99%

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

Li¹,

Yang²,

Liang³

et al. 2021

Preprint

View full text Add to dashboard Cite

We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a γ-discounted MDP. We establish asymptotic normality for the averaged iteration QT . Furthermore, we show that QT is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function Q * with the most efficient influence function. It implies the averaged Q-learning iteration has the smallest asymptotic variance among all RAL estimators. In addition, we present a non-asymptotic analysis for the ∞ error E QT − Q * ∞ , showing it matches the instance-dependent lower bound as well as the optimal minimax complexity lower bound. As a byproduct, we find the Bellman noise has sub-Gaussian coordinates with variance O((1 − γ) −1 ) instead of the prevailing O((1 − γ) −2 ) under the standard bounded reward assumption. The sub-Gaussian result has potential to improve the sample complexity of many RL algorithms. In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.

show abstract

Section: Related Workmentioning

confidence: 99%

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

Li¹,

Yang²,

Liang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, both papers assume a generative model, where observation tuples are sampled independently from the stationary distribution of the underlying Markov decision process. Shi et al (2021) proposed an inference method for the state-action value (Q) function via sieve methods to approximate the Q-function. This is an offline method that directly computes the value estimates using batch updates.…”

Section: Related Work and Our Contributionsmentioning

confidence: 99%

Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning

Ramprasad¹,

Li²,

Yang³

et al. 2021

Preprint

View full text Add to dashboard Cite

The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments.

show abstract

“…Our goal is to develop an approach to estimate the value function under each candidate model during our policy optimization procedure with theoretical guarantee. The proposed algorithm is motivated by recent development in statistical inference of sequential decision making (Luedtke & Van Der Laan, 2016;Shi et al, 2020). The idea is to first estimate optimal Q-function Q * , optimal policy π * and the resulting ratio function based on a chunk of data, and evaluate the performance of the estimated policy on the next chunk of data using previously estimated nuisance functions.…”

Section: Sequential Model Selectionmentioning

confidence: 99%

Pessimistic Model Selection for Offline Deep Reinforcement Learning

Yang

Cui

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data, model selection is a challenging task as there is no ground truth available for performance demonstration, in contrast with the online setting with simulated environments. In this work, we propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models. Two refined approaches are also proposed to address the potential bias of DRL model in identifying the optimal policy. Numerical studies demonstrated the superior performance of our approach over existing methods.

show abstract

Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings

Cited by 16 publications

References 40 publications

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning

Pessimistic Model Selection for Offline Deep Reinforcement Learning

Contact Info

Product

Resources

About