2021
DOI: 10.48550/arxiv.2107.08285
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Abstract: Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 34 publications
0
5
0
Order By: Relevance
“…Their results provide a geometric perspective to help understand the dynamics of different RL algorithms (Kumar et al, 2019;Chan et al, 2020;Harb et al, 2020;Chan et al, 2021), and also inspire new methods in representation learning in RL (Bellemare et al, 2019;Dabney et al, 2021).…”
Section: Related Workmentioning
confidence: 96%
“…Their results provide a geometric perspective to help understand the dynamics of different RL algorithms (Kumar et al, 2019;Chan et al, 2020;Harb et al, 2020;Chan et al, 2021), and also inspire new methods in representation learning in RL (Bellemare et al, 2019;Dabney et al, 2021).…”
Section: Related Workmentioning
confidence: 96%
“…The discount factor γ is introduced to reduce variance and improve the convergence of soft-Q iteration [13,8]. For stability, we replace the bootstrap term Q ⊥(φ) with a φ-averaging target network Q ψ using the Polyak method [43].…”
Section: Learning Critic Models With Soft Q-learningmentioning
confidence: 99%
“…Their derivation is based on the two-filter formula [10,36], but the method is equivalent to the one that would be obtained using the equations from Figure 1b with the heuristic factors set to be the value function h t = V φ (s t+1 ). This formulation can not accommodate putative action particles and uses a parametric policy to learn the value function instead of the soft Bellman update [5,13].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…This causes q to choose any mode when p is multimodal, resulting in a concentration of probability density on that mode and ignoring the other high-density modes. Consequently, optimising RKL leads to a suboptimal solution, causing q to have limited support, leading to mode collapse [34][35][36]. In a sense, mode collapse and mode seeking are equivalent because mode-seeking leads to focusing on a few modes only, which causes mode collapse.…”
Section: Introductionmentioning
confidence: 99%