2020
DOI: 10.48550/arxiv.2003.14089
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Leverage the Average: an Analysis of KL Regularization in RL

Abstract: Building upon the formalism of regularized Markov decision processes, we study the effect of Kullback-Leibler (KL) and entropy regularization in reinforcement learning. Through an equivalent formulation of the related approximate dynamic programming (ADP) scheme, we show that a KL penalty amounts to averaging q-values. This equivalence allows drawing connections between a priori disconnected methods from the literature, and proving that a KL regularization indeed leads to averaging errors made at each iteratio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

1
24
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(25 citation statements)
references
References 6 publications
1
24
0
Order By: Relevance
“…More recently, several works have proposed regularizing policy updates by KL-divergence to the previous policy (Abbasi-Yadkori et al, 2019;Hao et al, 2020;Vieillard et al, 2020a;Tomar et al, 2020). As a concrete instantiation, the POLITEX algorithm (Abbasi-Yadkori et al, 2019) updates policies as…”
Section: Regularized Policy Updatesmentioning
confidence: 99%
See 2 more Smart Citations
“…More recently, several works have proposed regularizing policy updates by KL-divergence to the previous policy (Abbasi-Yadkori et al, 2019;Hao et al, 2020;Vieillard et al, 2020a;Tomar et al, 2020). As a concrete instantiation, the POLITEX algorithm (Abbasi-Yadkori et al, 2019) updates policies as…”
Section: Regularized Policy Updatesmentioning
confidence: 99%
“…Unfortunately, if advantage functions are approximated by neural networks, the above update requires us to store the parameters of all past networks in memory, which is impractical. Possible heuristics for ensuring memory efficiency include subsampling action-value networks (Abbasi-Yadkori et al, 2019) and/or distillation to approximate the sum by a single network (Vieillard et al, 2020a).…”
Section: Regularized Policy Updatesmentioning
confidence: 99%
See 1 more Smart Citation
“…In a somewhat related vein, a number of works use REPSinspired derivations to yield dynamic programming algorithms (Fox et al, 2017;Geist et al, 2019;Vieillard et al, 2020) and subsequently provide guarantees on the convergence of approximate dynamic programming in these settings. Our results focus on the use of REPS in a convex programming context, and optimizing these programs via standard gradient-based solvers.…”
Section: Related Workmentioning
confidence: 99%
“…still serves as a γ-contraction [32,18]. When using the regularized Bellman operator, dynamic programming methods could still achieve a linear convergence rate [18,29,25]. Smirnova and Dohmatob [25] analyzed the convergence of a general form of regularized policy iteration when λ t decays in a asymptotic sense.…”
Section: Related Workmentioning
confidence: 99%