A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

Chen, Zaiwei; Maguluri, Siva Theja; Shakkottai, Sanjay; Shanmugam, Karthikeyan

doi:10.48550/arxiv.2102.01567

Cited by 21 publications

(105 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Central limit theorems [For15] and non-asymptotic convergence rates [KMMW19] have been established for controlled Markov processes. In addition to the papers discussed in Section 1, several recent works have considered particular aspects of SA with Markov data, including two-timescale variants [DNPR20, KB18], observation skipping schemes for bias reduction [KLL20], Lyapunov function-based analysis under general norms [CMSS21], and proving guarantees under weaker ergodicity conditions [DDA21].…”

Section: Stochastic Approximation Methodsmentioning

confidence: 99%

“…There is a long line of past work on this algorithm, including convergence guarantees [Tsi94, Sze98, EDM03], results on linear function approximation for optimal stopping problems [TVR99, BRS18], and non-asymptotic rates under general norms in both the i.i.d. setting [Wai19a,Bor21] as well as the Markovian setting [CMSS21]. A class of variants of TD and Q-learning are also studied in literature, including actor-critic methods [KT00], SARSA [RN94], and methods that employ variance-reduction [SWW + 18, KPR + 21, Wai19b, KXWJ21].…”

Section: Application To Rl Problemsmentioning

confidence: 99%

See 1 more Smart Citation

Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

Mou¹,

Pananjady²,

Wainwright³

2021

Preprint

View full text Add to dashboard Cite

We study stochastic approximation procedures for approximately solving a d-dimensional linear fixed point equation based on observing a trajectory of length n from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order t mix d n on the squared error of the last iterate of a standard scheme, where t mix is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters (d, t mix ) in the higher order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instanceoptimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise-covering the TD(λ) family of algorithms for all λ ∈ [0, 1)and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of λ when running the TD(λ) algorithm).

show abstract

Section: Stochastic Approximation Methodsmentioning

confidence: 99%

Section: Application To Rl Problemsmentioning

confidence: 99%

Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

Mou¹,

Pananjady²,

Wainwright³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In recent work, a subset of the current authors (Kotsalis et al, 2020b) provided an improved analysis of vanilla TD algorithm that can benefit from parallel computing. Other notable analyses of TD learning and some of its variants include those by Srikant and Ying (2019); Chen et al (2021); Durmus et al (2021). While some of these analyses are sharp, it is well-known that vanilla TD learning does not attain optimal oracle and sample complexities.…”

Section: Related Workmentioning

confidence: 99%

Accelerated and instance-optimal policy evaluation with linear function approximation

Li¹,

Lan²,

Pananjady³

2021

Preprint

View full text Add to dashboard Cite

We study the problem of policy evaluation with linear function approximation and present efficient and practical algorithms that come with strong optimality guarantees. We begin by proving lower bounds that establish baselines on both the deterministic error and stochastic error in this problem. In particular, we prove an oracle complexity lower bound on the deterministic error in an instance-dependent norm associated with the stationary distribution of the transition kernel, and use the local asymptotic minimax machinery to prove an instance-dependent lower bound on the stochastic error in the i.i.d. observation model. Existing algorithms fail to match at least one of these lower bounds: To illustrate, we analyze a variance-reduced variant of temporal difference learning, showing in particular that it fails to achieve the oracle complexity lower bound. To remedy this issue, we develop an accelerated, variance-reduced fast temporal difference algorithm (VRFTD) that simultaneously matches both lower bounds and attains a strong notion of instance-optimality. Finally, we extend the VRFTD algorithm to the setting with Markovian observations, and provide instance-dependent convergence results that match those in the i.i.d. setting up to a multiplicative factor that is proportional to the mixing time of the chain. Our theoretical guarantees of optimality are corroborated by numerical experiments.1 Two of these observation models are formally discussed in Section 2.

show abstract

“…One important ingredient of our convergence results is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains. Similar to Chen et al (2021b), we rely on the use of the generalized Moreau envelope to form a Lyapunov function. Our results, however, extend those of Chen et al (2021b) from time-homogeneous to time-inhomogeneous Markov chains and from time-homogeneous to time-inhomogeneous update operators.…”

Section: Introductionmentioning

confidence: 99%

“…Similar to Chen et al (2021b), we rely on the use of the generalized Moreau envelope to form a Lyapunov function. Our results, however, extend those of Chen et al (2021b) from time-homogeneous to time-inhomogeneous Markov chains and from time-homogeneous to time-inhomogeneous update operators. Those extensions make our results immediately applicable to the off-policy actor-critic settings and are made possible by establishing a form of uniform contraction of the time-inhomogeneous update operators.…”

Section: Introductionmentioning

confidence: 99%

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch

Zhang

Combes

Laroche

2021

Preprint

View full text Add to dashboard Cite

In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy. Our work goes beyond existing works on the optimality of policy gradient methods in that existing works use the exact policy gradient for updating the policy parameters while we use an approximate and stochastic update step. Our update step is not a gradient update because we do not use a density ratio to correct the state distribution, which aligns well with what practitioners do. Our update is approximate because we use a learned critic instead of the true value function. Our update is stochastic because at each step the update is done for only the current state action pair. Moreover, we remove several restrictive assumptions from existing works in our analysis. Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains, based on its uniform contraction properties.

show abstract

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

Cited by 21 publications

References 41 publications

Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

Accelerated and instance-optimal policy evaluation with linear function approximation

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch

Contact Info

Product

Resources

About