2021
DOI: 10.48550/arxiv.2102.06234
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Optimization Issues in KL-Constrained Approximate Policy Iteration

Abstract: Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API). While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy. Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies, arguing that this is easier to implement and tune. In this work, we study this implementation choice in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 8 publications
0
2
0
Order By: Relevance
“…Here we consider simply best selection, and it remains open to understand other practical training tricks, such as proximal update [16,26] and regularization [27] under stochastic settings.…”
Section: Ensemble Methodsmentioning
confidence: 99%
“…Here we consider simply best selection, and it remains open to understand other practical training tricks, such as proximal update [16,26] and regularization [27] under stochastic settings.…”
Section: Ensemble Methodsmentioning
confidence: 99%
“…If the the limit of the variance is small enough, then Theorem 3.2 implies that the trust region method converges. In [22], a linear expected regret is proved in TRPO. Actually, this is consistent with our conclusions.…”
Section: Parameterized Policymentioning
confidence: 99%