2017
DOI: 10.48550/arxiv.1705.07798
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A unified view of entropy-regularized Markov decision processes

Abstract: We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to for… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

1
121
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 73 publications
(122 citation statements)
references
References 14 publications
1
121
0
Order By: Relevance
“…This similarity in addressing the inherent geometry of the problem is noticed by a line of recent work including Neu et al (2017); Geist et al (2019); ; Tomar et al (2020); Lan (2021), and the analysis techniques in MD methods have been adapted to the PG setting. The connection was first built explicitly in Neu et al (2017).…”
Section: Background and Related Workmentioning
confidence: 86%
“…This similarity in addressing the inherent geometry of the problem is noticed by a line of recent work including Neu et al (2017); Geist et al (2019); ; Tomar et al (2020); Lan (2021), and the analysis techniques in MD methods have been adapted to the PG setting. The connection was first built explicitly in Neu et al (2017).…”
Section: Background and Related Workmentioning
confidence: 86%
“…is a mirror descent process which guarantees the convergence. With this property, similar to Neu et al (2017) and Wang et al (2019), we can use the following iterative process,…”
Section: Target Policymentioning
confidence: 95%
“…The former is usually a deterministic policy (Sutton and Barto, 2018) which is not flexible enough for unknown situations, while the latter is a policy with non-zero probability for all actions which may be dangerous in some scenarios. Neu et al (2017) analyzed the entropy regularization method from several views. They revealed a more general form of regularization which is actually divergence regularization and showed entropy regularization is just a special case of divergence regularization.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The analyses therein heavily exploit the contraction properties of the Bellman optimality condition, making their extensions to the stochastic setting, with only stochastic first-order information, unclear without additional assumptions [9]. Connections between PG methods and the classical mirror descent algorithm in optimization [2,21,22] have also been established and exploited to establish convergence of the former methods (e.g., TRPO [27,28], REPS [24,25]). Until recently, [16] proposes policy mirror descent methods and its stochastic variants for general convex regularizers, and establishes linear convergence in both deterministic and stochastic setting.…”
mentioning
confidence: 99%