2021
DOI: 10.48550/arxiv.2102.01567
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

Abstract: This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous Reinforcement Learning (RL) algorithms. We do this by first reformulating the RL algorithms as Markovian Stochastic Approximation (SA) algorithms to solve fixed-point equations.We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this central result, we establish finite-sample mean-square convergence bounds for as… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

8
97
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 21 publications
(105 citation statements)
references
References 41 publications
8
97
0
Order By: Relevance
“…Central limit theorems [For15] and non-asymptotic convergence rates [KMMW19] have been established for controlled Markov processes. In addition to the papers discussed in Section 1, several recent works have considered particular aspects of SA with Markov data, including two-timescale variants [DNPR20, KB18], observation skipping schemes for bias reduction [KLL20], Lyapunov function-based analysis under general norms [CMSS21], and proving guarantees under weaker ergodicity conditions [DDA21].…”
Section: Stochastic Approximation Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Central limit theorems [For15] and non-asymptotic convergence rates [KMMW19] have been established for controlled Markov processes. In addition to the papers discussed in Section 1, several recent works have considered particular aspects of SA with Markov data, including two-timescale variants [DNPR20, KB18], observation skipping schemes for bias reduction [KLL20], Lyapunov function-based analysis under general norms [CMSS21], and proving guarantees under weaker ergodicity conditions [DDA21].…”
Section: Stochastic Approximation Methodsmentioning
confidence: 99%
“…There is a long line of past work on this algorithm, including convergence guarantees [Tsi94, Sze98, EDM03], results on linear function approximation for optimal stopping problems [TVR99, BRS18], and non-asymptotic rates under general norms in both the i.i.d. setting [Wai19a,Bor21] as well as the Markovian setting [CMSS21]. A class of variants of TD and Q-learning are also studied in literature, including actor-critic methods [KT00], SARSA [RN94], and methods that employ variance-reduction [SWW + 18, KPR + 21, Wai19b, KXWJ21].…”
Section: Application To Rl Problemsmentioning
confidence: 99%
“…In recent work, a subset of the current authors (Kotsalis et al, 2020b) provided an improved analysis of vanilla TD algorithm that can benefit from parallel computing. Other notable analyses of TD learning and some of its variants include those by Srikant and Ying (2019); Chen et al (2021); Durmus et al (2021). While some of these analyses are sharp, it is well-known that vanilla TD learning does not attain optimal oracle and sample complexities.…”
Section: Related Workmentioning
confidence: 99%
“…One important ingredient of our convergence results is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains. Similar to Chen et al (2021b), we rely on the use of the generalized Moreau envelope to form a Lyapunov function. Our results, however, extend those of Chen et al (2021b) from time-homogeneous to time-inhomogeneous Markov chains and from time-homogeneous to time-inhomogeneous update operators.…”
Section: Introductionmentioning
confidence: 99%
“…Similar to Chen et al (2021b), we rely on the use of the generalized Moreau envelope to form a Lyapunov function. Our results, however, extend those of Chen et al (2021b) from time-homogeneous to time-inhomogeneous Markov chains and from time-homogeneous to time-inhomogeneous update operators. Those extensions make our results immediately applicable to the off-policy actor-critic settings and are made possible by establishing a form of uniform contraction of the time-inhomogeneous update operators.…”
Section: Introductionmentioning
confidence: 99%