What Can Learned Intrinsic Rewards Capture?

Zheng, Zeyu; Oh, Junhyuk; Hessel, Matteo; Xu, Zhongwen; Kroiss, Manuel; Hasselt, Hado van; Silver, David; Singh, Satinder

doi:10.48550/arxiv.1912.05500

Cited by 10 publications

(6 citation statements)

References 12 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The optimal reward framework [60,63] and shaped rewards [47] (if generated by the agent itself) also consider intrinsic motivation as a way to assist an RL agent in learning the optimal policy for a given task. Such an intrinsically motivated reward signal has previously been learned through various methods such as evolutionary techniques [49,57], meta-gradient approaches [62,72,73], and others. The Wasserstein distance has been used to present a valid reward for imitation learning [70,17] as well as program synthesis [24].…”

Section: Intrinsic Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Adversarial Intrinsic Motivation for Reinforcement Learning

Durugkar¹,

Tec²,

Niekum³

et al. 2021

Preprint

View full text Add to dashboard Cite

Learning with an objective to minimize the mismatch with a reference distribution has been shown to be useful for generative modeling and imitation learning. In this paper, we investigate whether one such objective, the Wasserstein-1 distance between a policy's state visitation distribution and a target distribution, can be utilized effectively for reinforcement learning (RL) tasks. Specifically, this paper focuses on goal-conditioned reinforcement learning where the idealized (unachievable) target distribution has full measure at the goal. We introduce a quasimetric specific to Markov Decision Processes (MDPs), and show that the policy that minimizes the Wasserstein-1 distance of its state visitation distribution to this target distribution under this quasimetric is the policy that reaches the goal in as few steps as possible. Our approach, termed Adversarial Intrinsic Motivation (AIM), estimates this Wasserstein-1 distance through its dual objective and uses it to compute a supplemental reward function. Our experiments show that this reward function changes smoothly with respect to transitions in the MDP and assists the agent in learning. Additionally, we combine AIM with Hindsight Experience Replay (HER) and show that the resulting algorithm accelerates learning significantly on several simulated robotics tasks when compared to HER with a sparse positive reward at the goal state.Preprint. Under review.

show abstract

Section: Intrinsic Motivationmentioning

confidence: 99%

“…Unfortunately, the optimal policy under such modified rewards might sometimes be different than the optimal policy under the task reward [47,16]. The problem of learning a reward signal that speeds up learning by communicating what to do but does not interfere by specifying how to do it is thus a useful and complex one [73].…”

Section: Introductionmentioning

confidence: 99%

Adversarial Intrinsic Motivation for Reinforcement Learning

Durugkar¹,

Tec²,

Niekum³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The optimal reward framework [33,35] and shaped rewards [23] (if generated by the agent itself) also consider intrinsic motivation as a way to assist an RL agent in learning the optimal policy for a given task. Such an intrinsically motivated reward signal has previously been learned through various methods such as evolutionary techniques [24,30], meta-gradient approaches [34,39,40], and others. The Wasserstein distance, in particular, has been used to present a valid reward for speeding up learning of goal-conditioned policies [11], imitation learning [37,10,38], as well as program synthesis [15].…”

Section: Related Workmentioning

confidence: 99%

Wasserstein Distance Maximizing Intrinsic Control

Durugkar¹,

Hansen²,

Spencer³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper deals with the problem of learning a skill-conditioned policy that acts meaningfully in the absence of a reward signal. Mutual information based objectives have shown some success in learning skills that reach a diverse set of states in this setting. These objectives include a KL-divergence term, which is maximized by visiting distinct states even if those states are not far apart in the MDP. This paper presents an approach that rewards the agent for learning skills that maximize the Wasserstein distance of their state visitation from the start state of the skill. It shows that such an objective leads to a policy that covers more distance in the MDP than diversity based objectives, and validates the results on a variety of Atari environments.

show abstract

“…Another core problem of continual RL that meta-learning can potentially help with is exploration. Meta-learning has been repeatedly used in the recent literature to learn functions for intrinsic motivation and improved exploration (Baranes and Oudeyer, 2009;Zheng et al, 2018;Xu et al, 2018a;Yang et al, 2019;Zou et al, 2019;Zheng et al, 2019).…”

Section: Learning To Explorementioning

confidence: 99%

Towards Continual Reinforcement Learning: A Review and Perspectives

Khetarpal¹,

Riemer²,

Rish³

et al. 2020

Preprint

View full text Add to dashboard Cite

In this article, we aim to provide a literature review of different formulations and approaches to continual reinforcement learning (RL), also known as lifelong or non-stationary RL. We begin by discussing our perspective on why RL is a natural fit for studying continual learning. We then provide a taxonomy of different continual RL formulations and mathematically characterize the non-stationary dynamics of each setting. We go on to discuss evaluation of continual RL agents, providing an overview of benchmarks used in the literature and important metrics for understanding agent performance. Finally, we highlight open problems and challenges in bridging the gap between the current state of continual RL and findings in neuroscience. While still in its early days, the study of continual RL has the promise to develop better incremental reinforcement learners that can function in increasingly realistic applications where non-stationarity plays a vital role. These include applications such as those in the fields of healthcare, education, logistics, and robotics. 1 * . The authors contributed equally to this work. 1. This survey is a continual work in progress, so please reach out if we have omitted any important references.

show abstract

What Can Learned Intrinsic Rewards Capture?

Cited by 10 publications

References 12 publications

Adversarial Intrinsic Motivation for Reinforcement Learning

Adversarial Intrinsic Motivation for Reinforcement Learning

Wasserstein Distance Maximizing Intrinsic Control

Towards Continual Reinforcement Learning: A Review and Perspectives

Contact Info

Product

Resources

About