Efficient Exploration via State Marginal Matching

Lee, Lisa; Eysenbach, Benjamin; Parisotto, Emilio; Xing, Eric; Levine, Sergey; Salakhutdinov, Ruslan

doi:10.48550/arxiv.1906.05274

Cited by 53 publications

(118 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is important to note, however, that Theorem 2 and Corollary 1 are of independent interest. Recent works [18,48,26] consider the problem of maximizing entropy of occupancy measures, but none of them provide a simple gradient expression that can be used to directly perform gradient ascent in θ on the entropies (4), (5). Our entropy gradient theorem and corresponding corollary resolve this issue, and entropy maximization algorithms using it to maximize (4), (5) are provided in the appendix.…”

Section: Entropy and Oir Policy Gradient Theoremsmentioning

confidence: 99%

“…This raises a crucial question: how should the informativeness of policies be quantified? Occupancy measure entropy has recently been used as an optimization objective that quantifies the expected amount of exploration of the state (or state-action) space that a policy performs [18,26,48]. In other words, occupancy measure entropy quantifies the amount of information about the environment that a policy provides, on average, by measuring how uniformly the policy covers the state (or state-action) space.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

Suttle¹,

Koppel²,

Liu³

2022

Preprint

View full text Add to dashboard Cite

We develop a new measure of the exploration/exploitation trade-off in infinite-horizon reinforcement learning problems called the occupancy information ratio (OIR), which is comprised of a ratio between the infinite-horizon average cost of a policy and the entropy of its long-term state occupancy measure. The OIR ensures that no matter how many trajectories an RL agent traverses or how well it learns to minimize cost, it maintains a healthy skepticism about its environment, in that it defines an optimal policy which induces a high-entropy occupancy measure. Different from earlier information ratio notions, OIR is amenable to direct policy search over parameterized families, and exhibits hidden quasiconcavity through invocation of the perspective transformation. This feature ensures that under appropriate policy parameterizations, the OIR optimization problem has no spurious stationary points, despite the overall problem's nonconvexity. We develop for the first time policy gradient and actor-critic algorithms for OIR optimization based upon a new entropy gradient theorem, and establish both asymptotic and nonasymptotic convergence results with global optimality guarantees. In experiments, these methodologies outperform several deep RL baselines in problems with sparse rewards, where many trajectories may be uninformative and skepticism about the environment is crucial to success.

show abstract

Section: Entropy and Oir Policy Gradient Theoremsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

Suttle¹,

Koppel²,

Liu³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Distributional Reinforcement Learning and State-Marginal Matching Modeling the full distribution of returns instead of the averages led to the development of distributional RL algorithms (Bellemare et al, 2017;Castro et al, 2018;Barth-Maron et al, 2018) such as Categorical Q-learning (Bellemare et al, 2017). While our work shares techniques such as discretization and binning, these works focus on optimizing a non-conditional reward-maximizing RL policy and therefore our problem definition is closer to that of state-marginal matching algorithms (Hazan et al, 2019;Lee et al, 2020;Ghasemipour et al, 2020;, or equivalently inverse RL algorithms (Ziebart et al, 2008;Ho & Ermon, 2016;Finn et al, 2016;Fu et al, 2018;Ghasemipour et al, 2020) whose connections to feature-expectation matching have been long discussed (Abbeel & Ng, 2004). However, those are often exclusively online algorithms even sample-efficient variants (Kostrikov et al, 2019), since density-ratio estimations with either discriminative (Ghasemipour et al, 2020) or generative (Lee et al, 2020) approach requires on-policy samples, with a rare exception of Kostrikov et al (2020).…”

Section: Related Workmentioning

confidence: 99%

“…While our work shares techniques such as discretization and binning, these works focus on optimizing a non-conditional reward-maximizing RL policy and therefore our problem definition is closer to that of state-marginal matching algorithms (Hazan et al, 2019;Lee et al, 2020;Ghasemipour et al, 2020;, or equivalently inverse RL algorithms (Ziebart et al, 2008;Ho & Ermon, 2016;Finn et al, 2016;Fu et al, 2018;Ghasemipour et al, 2020) whose connections to feature-expectation matching have been long discussed (Abbeel & Ng, 2004). However, those are often exclusively online algorithms even sample-efficient variants (Kostrikov et al, 2019), since density-ratio estimations with either discriminative (Ghasemipour et al, 2020) or generative (Lee et al, 2020) approach requires on-policy samples, with a rare exception of Kostrikov et al (2020). Building on the success of DT and brute-force hindsight imitation learning, our Categorical DT is to the best our knowledge the first method that benchmarks offline state-marginal matching problem in the multi-task settings.…”

Section: Related Workmentioning

confidence: 99%

“…• We introduce hindsight information matching (HIM) (Section 4, Table 1) as a unifying view of existing hindsight-inspired algorithms, and Generalized Decision Transformers (GDT) as a generalization of DT for RL as sequence modeling to solve any HIM problem (Figure 1). • Inspired by distribution RL (Bellemare et al, 2017; and state-marginal matching (SMM) (Lee et al, 2020;Ghasemipour et al, 2020;, we define offline multi-task SMM problems, propose Categorical DT (CDT) (Section 5), validate its empirical performance to match feature distributions (even generalizing to a synthetic bi-modal target distribution at times), and construct the first benchmark tasks for offline multi-task SMM. • Inspired by one-shot imitation learning (Duan et al, 2017;Finn et al, 2017;Dasari & Gupta, 2020), we define offline multi-task imitation learning (IL), propose a Wasserstein-distance evaluation metric, develop Bi-directional DT (BDT) as a fully expressive variant of GDT (Section 5), and demonstrate BDT's competitive performance at offline multi-task IL.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Generalized Decision Transformer for Offline Hindsight Information Matching

Furuta¹,

Matsuo²,

Gu³

2021

Preprint

View full text Add to dashboard Cite

How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -such as future states in hindsight experience replay (HER) or returnsto-go in Decision Transformer (DT) -enables efficient learning of multi-task policies, where at times online RL is fully replaced by offline behavioral cloning (BC), e.g. sequence modeling. We demonstrate that all these approaches are doing hindsight information matching (HIM) -training policies that can output the rest of trajectory that matches some statistics of future state information. We present Generalized Decision Transformer (GDT) for solving any HIM problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future. For evaluating CDT and BDT, we define offline multi-task state-marginal matching (SMM) and imitation learning (IL) as two generic HIM problems, propose a Wasserstein distance loss as a metric for both, and empirically study them on MuJoCo continuous control benchmarks. Categorical DT, which simply replaces anti-causal summation with anti-causal binning in DT, enables arguably the first effective offline multi-task SMM algorithm that generalizes well to unseen (and even synthetic) multi-modal reward or state-feature distributions. Bi-directional DT, which uses an anti-causal second transformer as the aggregator, can learn to model any statistics of the future and outperforms DT variants in offline multi-task IL, i.e. one-shot IL. Our generalized formulations from HIM and GDT greatly expand the role of powerful sequence modeling architectures in modern RL.

show abstract

Batch Exploration With Examples for Scalable Robotic Reinforcement Learning

Chen

Nam

Nair

et al. 2021

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

Learning from diverse offline datasets is a promising path towards learning general purpose robotic agents. However, a core challenge in this paradigm lies in collecting large amounts of meaningful data, while not depending on a human in the loop for data collection. One way to address this challenge is through task-agnostic exploration, where an agent attempts to explore without a task-specific reward function, and collect data that can be useful for any downstream task. While these approaches have shown some promise in simple domains, they often struggle to explore the relevant regions of the state space in more challenging settings, such as vision based robotic manipulation. This challenge stems from an objective that encourages exploring everything in a potentially vast state space. To mitigate this challenge, we propose to focus exploration on the important parts of the state space using weak human supervision. Concretely, we propose an exploration technique, Batch Exploration with Examples (BEE), that explores relevant regions of the state-space, guided by a modest number of human provided images of important states. These human provided images only need to be collected once at the beginning of data collection and can be collected in a matter of minutes, allowing us to scalably collect diverse datasets, which can then be combined with any batch RL algorithm. We find that BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot, and observe that compared to task-agnostic and weakly-supervised exploration techniques, it (1) interacts more than twice as often with relevant objects, and (2) improves downstream task performance when used in conjunction with offline RL.

show abstract

Efficient Exploration via State Marginal Matching

Cited by 53 publications

References 21 publications

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

Generalized Decision Transformer for Offline Hindsight Information Matching

Batch Exploration With Examples for Scalable Robotic Reinforcement Learning

Contact Info

Product

Resources

About