A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

Li, Xiang; Yang, Wenhao; Liang, Jiadong; Zhang, Zhihua; Jordan, Michael I.

doi:10.48550/arxiv.2112.14582

Cited by 1 publication

(2 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There exist minimax-optimal model-based algorithms for learning a near-optimal value function (Azar et al, 2013) and policy (Agarwal et al, 2020;Li et al, 2020). Also, there exist minimax-optimal model-free algorithms for learning a near-optimal value function (Wainwright, 2019;Khamaru et al, 2021;Li et al, 2021b) and policy (Sidford et al, 2018). While model-based algorithms are conceptually simple, they have a higher computational complexity than that of model-free algorithms.…”

Section: Introductionmentioning

confidence: 99%

“…MDVI's underlying idea that enables such simplicity is, while implicit, the averaging of value function estimates. Li et al (2021b) shows that averaging Q-functions computed in Q-learning can find a near-optimal Q-function with a minimax-optimal sample complexity. Azar et al (2011) also provides a simple algorithm called Speedy Q-learning (SQL), which performs the averaging of value function estimates.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal

Kozuno¹,

Yang²,

Vieillard³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. ( 2020a), which uses the Kullback-Leibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimax-optimal for finding an ε-optimal policy when ε is sufficiently small. This is the first theoretical result that demonstrates that a simple model-free algorithm without variance-reduction can be nearly minimax-optimal under the considered setting.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%