Our system is currently under heavy load due to increased usage. We're actively working on upgrades to improve performance. Thank you for your patience.
2021
DOI: 10.48550/arxiv.2112.14582
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

Abstract: We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a γ-discounted MDP. We establish asymptotic normality for the averaged iteration QT . Furthermore, we show that QT is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function Q * with the most efficient influence function. It implies the averaged Q-learning iteration has the smallest asymptotic variance among all RAL estimators. In addition, we present a non-asymptotic analysis for the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 11 publications
0
2
0
Order By: Relevance
“…There exist minimax-optimal model-based algorithms for learning a near-optimal value function (Azar et al, 2013) and policy (Agarwal et al, 2020;Li et al, 2020). Also, there exist minimax-optimal model-free algorithms for learning a near-optimal value function (Wainwright, 2019;Khamaru et al, 2021;Li et al, 2021b) and policy (Sidford et al, 2018). While model-based algorithms are conceptually simple, they have a higher computational complexity than that of model-free algorithms.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…There exist minimax-optimal model-based algorithms for learning a near-optimal value function (Azar et al, 2013) and policy (Agarwal et al, 2020;Li et al, 2020). Also, there exist minimax-optimal model-free algorithms for learning a near-optimal value function (Wainwright, 2019;Khamaru et al, 2021;Li et al, 2021b) and policy (Sidford et al, 2018). While model-based algorithms are conceptually simple, they have a higher computational complexity than that of model-free algorithms.…”
Section: Introductionmentioning
confidence: 99%
“…MDVI's underlying idea that enables such simplicity is, while implicit, the averaging of value function estimates. Li et al (2021b) shows that averaging Q-functions computed in Q-learning can find a near-optimal Q-function with a minimax-optimal sample complexity. Azar et al (2011) also provides a simple algorithm called Speedy Q-learning (SQL), which performs the averaging of value function estimates.…”
Section: Introductionmentioning
confidence: 99%