2019
DOI: 10.48550/arxiv.1906.05274
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient Exploration via State Marginal Matching

Lisa Lee,
Benjamin Eysenbach,
Emilio Parisotto
et al.

Abstract: To solve tasks with sparse rewards, reinforcement learning algorithms must be equipped with suitable exploration techniques. However, it is unclear what underlying objective is being optimized by existing exploration algorithms, or how they can be altered to incorporate prior knowledge about the task. Most importantly, it is difficult to use exploration experience from one task to acquire exploration strategies for another task. We address these shortcomings by learning a single exploration policy that can qui… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
97
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 53 publications
(118 citation statements)
references
References 21 publications
1
97
0
Order By: Relevance
“…It is important to note, however, that Theorem 2 and Corollary 1 are of independent interest. Recent works [18,48,26] consider the problem of maximizing entropy of occupancy measures, but none of them provide a simple gradient expression that can be used to directly perform gradient ascent in θ on the entropies (4), (5). Our entropy gradient theorem and corresponding corollary resolve this issue, and entropy maximization algorithms using it to maximize (4), (5) are provided in the appendix.…”
Section: Entropy and Oir Policy Gradient Theoremsmentioning
confidence: 99%
See 1 more Smart Citation
“…It is important to note, however, that Theorem 2 and Corollary 1 are of independent interest. Recent works [18,48,26] consider the problem of maximizing entropy of occupancy measures, but none of them provide a simple gradient expression that can be used to directly perform gradient ascent in θ on the entropies (4), (5). Our entropy gradient theorem and corresponding corollary resolve this issue, and entropy maximization algorithms using it to maximize (4), (5) are provided in the appendix.…”
Section: Entropy and Oir Policy Gradient Theoremsmentioning
confidence: 99%
“…This raises a crucial question: how should the informativeness of policies be quantified? Occupancy measure entropy has recently been used as an optimization objective that quantifies the expected amount of exploration of the state (or state-action) space that a policy performs [18,26,48]. In other words, occupancy measure entropy quantifies the amount of information about the environment that a policy provides, on average, by measuring how uniformly the policy covers the state (or state-action) space.…”
Section: Introductionmentioning
confidence: 99%
“…Distributional Reinforcement Learning and State-Marginal Matching Modeling the full distribution of returns instead of the averages led to the development of distributional RL algorithms (Bellemare et al, 2017;Castro et al, 2018;Barth-Maron et al, 2018) such as Categorical Q-learning (Bellemare et al, 2017). While our work shares techniques such as discretization and binning, these works focus on optimizing a non-conditional reward-maximizing RL policy and therefore our problem definition is closer to that of state-marginal matching algorithms (Hazan et al, 2019;Lee et al, 2020;Ghasemipour et al, 2020;, or equivalently inverse RL algorithms (Ziebart et al, 2008;Ho & Ermon, 2016;Finn et al, 2016;Fu et al, 2018;Ghasemipour et al, 2020) whose connections to feature-expectation matching have been long discussed (Abbeel & Ng, 2004). However, those are often exclusively online algorithms even sample-efficient variants (Kostrikov et al, 2019), since density-ratio estimations with either discriminative (Ghasemipour et al, 2020) or generative (Lee et al, 2020) approach requires on-policy samples, with a rare exception of Kostrikov et al (2020).…”
Section: Related Workmentioning
confidence: 99%
“…While our work shares techniques such as discretization and binning, these works focus on optimizing a non-conditional reward-maximizing RL policy and therefore our problem definition is closer to that of state-marginal matching algorithms (Hazan et al, 2019;Lee et al, 2020;Ghasemipour et al, 2020;, or equivalently inverse RL algorithms (Ziebart et al, 2008;Ho & Ermon, 2016;Finn et al, 2016;Fu et al, 2018;Ghasemipour et al, 2020) whose connections to feature-expectation matching have been long discussed (Abbeel & Ng, 2004). However, those are often exclusively online algorithms even sample-efficient variants (Kostrikov et al, 2019), since density-ratio estimations with either discriminative (Ghasemipour et al, 2020) or generative (Lee et al, 2020) approach requires on-policy samples, with a rare exception of Kostrikov et al (2020). Building on the success of DT and brute-force hindsight imitation learning, our Categorical DT is to the best our knowledge the first method that benchmarks offline state-marginal matching problem in the multi-task settings.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation