Leverage the Average: an Analysis of KL Regularization in RL

Vieillard, Nino; Kozuno, Tadashi; Scherrer, Bruno; Pietquin, Olivier; Munos, Rémi; Geist, Matthieu

doi:10.48550/arxiv.2003.14089

Cited by 8 publications

(25 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, several works have proposed regularizing policy updates by KL-divergence to the previous policy (Abbasi-Yadkori et al, 2019;Hao et al, 2020;Vieillard et al, 2020a;Tomar et al, 2020). As a concrete instantiation, the POLITEX algorithm (Abbasi-Yadkori et al, 2019) updates policies as…”

Section: Regularized Policy Updatesmentioning

confidence: 99%

“…Unfortunately, if advantage functions are approximated by neural networks, the above update requires us to store the parameters of all past networks in memory, which is impractical. Possible heuristics for ensuring memory efficiency include subsampling action-value networks (Abbasi-Yadkori et al, 2019) and/or distillation to approximate the sum by a single network (Vieillard et al, 2020a).…”

Section: Regularized Policy Updatesmentioning

confidence: 99%

“…Note that the above policy update remains the same if we replace advantage functions Aπ k with action-value functions Qπ k . gramming; in particularVieillard et al (2020a) show that it results in errors being averaged over iterations.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Optimization Issues in KL-Constrained Approximate Policy Iteration

Lazić¹,

Hao²,

Abbasi-Yadkori³

et al. 2021

Preprint

View full text Add to dashboard Cite

Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API). While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy. Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies, arguing that this is easier to implement and tune. In this work, we study this implementation choice in more detail. We compare the use of KL divergence as a constraint vs. as a regularizer, and point out several optimization issues with the widely-used constrained approach. We show that the constrained algorithm is not guaranteed to converge even on simple problem instances where the constrained problem can be solved exactly, and in fact incurs linear expected regret. With approximate implementation using softmax policies, we show that regularization can improve the optimization landscape of the original objective. We demonstrate these issues empirically on several bandit and RL environments.

show abstract

Section: Regularized Policy Updatesmentioning

confidence: 99%

Section: Regularized Policy Updatesmentioning

confidence: 99%

See 1 more Smart Citation

Optimization Issues in KL-Constrained Approximate Policy Iteration

Lazić¹,

Hao²,

Abbasi-Yadkori³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In a somewhat related vein, a number of works use REPSinspired derivations to yield dynamic programming algorithms (Fox et al, 2017;Geist et al, 2019;Vieillard et al, 2020) and subsequently provide guarantees on the convergence of approximate dynamic programming in these settings. Our results focus on the use of REPS in a convex programming context, and optimizing these programs via standard gradient-based solvers.…”

Section: Related Workmentioning

confidence: 99%

Near Optimal Policy Optimization via REPS

Pacchiano¹,

Nachum²

2021

Preprint

View full text Add to dashboard Cite

Since its introduction a decade ago, relative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains, not to mention providing algorithmic components used by many recently proposed reinforcement learning (RL) algorithms. While REPS is commonly known in the community, there exist no guarantees on its performance when using stochastic and gradient-based solvers. In this paper we aim to fill this gap by providing guarantees and convergence rates for the sub-optimality of a policy learned using first-order optimization methods applied to the REPS objective. We first consider the setting in which we are given access to exact gradients and demonstrate how near-optimality of the objective translates to near-optimality of the policy. We then consider the practical setting of stochastic gradients, and introduce a technique that uses generative access to the underlying Markov decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.

show abstract

“…still serves as a γ-contraction [32,18]. When using the regularized Bellman operator, dynamic programming methods could still achieve a linear convergence rate [18,29,25]. Smirnova and Dohmatob [25] analyzed the convergence of a general form of regularized policy iteration when λ t decays in a asymptotic sense.…”

Section: Related Workmentioning

confidence: 99%

Finding the Near Optimal Policy via Adaptive Reduced Regularization in MDPs

Yang¹,

Li²,

Xie³

et al. 2020

Preprint

View full text Add to dashboard Cite

Regularized MDPs serve as a smooth version of original MDPs. However, biased optimal policy always exists for regularized MDPs. Instead of making the coefficient λ of regularized term sufficiently small, we propose an adaptive reduction scheme for λ to approximate optimal policy of the original MDP. It is shown that the iteration complexity for obtaining an ε-optimal policy could be reduced in comparison with setting sufficiently small λ. In addition, there exists strong duality connection between the reduction method and solving the original MDP directly, from which we can derive more adaptive reduction method for certain algorithms.

show abstract

Leverage the Average: an Analysis of KL Regularization in RL

Cited by 8 publications

References 6 publications

Optimization Issues in KL-Constrained Approximate Policy Iteration

Optimization Issues in KL-Constrained Approximate Policy Iteration

Near Optimal Policy Optimization via REPS

Finding the Near Optimal Policy via Adaptive Reduced Regularization in MDPs

Contact Info

Product

Resources

About