Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning

Furuta, Hiroki; Matsushima, Tatsuya; Kozuno, Tadashi; Matsuo, Yutaka; Levine, Sergey; Nachum, Ofir; Gu, Shixiang

doi:10.48550/arxiv.2103.12726

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, certain forms of UCB-induced exploration lead to regret bounds that are comparable to model-based methods when used in conjunction with the model-free Q-learning algorithm in the tabular case [19]. In a distinct, but related direction, the recent work [14] proposes a novel, information-theoretic measure of task complexity, called policy(-optimal) information capacity, and empirically demonstrates how it can be used to determine the difficulty of a range of common RL problems.…”

Section: Related Workmentioning

confidence: 99%

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

Suttle¹,

Koppel²,

Liu³

2022

Preprint

View full text Add to dashboard Cite

We develop a new measure of the exploration/exploitation trade-off in infinite-horizon reinforcement learning problems called the occupancy information ratio (OIR), which is comprised of a ratio between the infinite-horizon average cost of a policy and the entropy of its long-term state occupancy measure. The OIR ensures that no matter how many trajectories an RL agent traverses or how well it learns to minimize cost, it maintains a healthy skepticism about its environment, in that it defines an optimal policy which induces a high-entropy occupancy measure. Different from earlier information ratio notions, OIR is amenable to direct policy search over parameterized families, and exhibits hidden quasiconcavity through invocation of the perspective transformation. This feature ensures that under appropriate policy parameterizations, the OIR optimization problem has no spurious stationary points, despite the overall problem's nonconvexity. We develop for the first time policy gradient and actor-critic algorithms for OIR optimization based upon a new entropy gradient theorem, and establish both asymptotic and nonasymptotic convergence results with global optimality guarantees. In experiments, these methodologies outperform several deep RL baselines in problems with sparse rewards, where many trajectories may be uninformative and skepticism about the environment is crucial to success.

show abstract

Section: Related Workmentioning

confidence: 99%