2019
DOI: 10.48550/arxiv.1903.07400
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration

Abstract: Exploration in sparse reward reinforcement learning remains an open challenge. Many state-of-the-art methods use intrinsic motivation to complement the sparse extrinsic reward signal, giving the agent more opportunities to receive feedback during exploration. Commonly these signals are added as bonus rewards, which results in a mixture policy that neither conducts exploration nor task fulfillment resolutely. In this paper, we instead learn separate intrinsic and extrinsic task policies and schedule between the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 26 publications
0
10
0
Order By: Relevance
“…Asymptotic Inconsistency. Approaches that define IR as the difference between state representations ψ(s) − ψ(s ) (ψ is a learned embedding network) (Zhang et al, 2019;Marino et al, 2019) suffer from asymptotic inconsistency. In other words, their IR does not vanish even after sufficient exploration: r i → 0 when N → ∞.…”
Section: Conceptual Advantages Of Bebold Over Existing Criteriamentioning
confidence: 99%
See 2 more Smart Citations
“…Asymptotic Inconsistency. Approaches that define IR as the difference between state representations ψ(s) − ψ(s ) (ψ is a learned embedding network) (Zhang et al, 2019;Marino et al, 2019) suffer from asymptotic inconsistency. In other words, their IR does not vanish even after sufficient exploration: r i → 0 when N → ∞.…”
Section: Conceptual Advantages Of Bebold Over Existing Criteriamentioning
confidence: 99%
“…For this, (Zhang et al, 2019) proposes to learn a separate scheduler to switch between intrinsic and extrinsic rewards, and divides the state representation difference by the square root of visitation counts. In comparison, BeBold does not require any extra stage and is a much simpler solution.…”
Section: Conceptual Advantages Of Bebold Over Existing Criteriamentioning
confidence: 99%
See 1 more Smart Citation
“…After training the pure-exploratory policy with intrinsic rewards, there are several ways to combine the intrinsic policy with extrinsic policies that are trained by other extrinsic rewards to enhance the performance. Scheduled intrinsic drive [34] uses a high-level scheduler that periodically selects to follow either the extrinsic or the intrinsic policy to gather experiences. MuleX [35] learns several policies independently and uses a random heuristic to decide which one to use in each time step.…”
Section: Explorationmentioning
confidence: 99%
“…While this might work when the two objectives are somewhat aligned, it may be inefficient when they are not [41]. Several works have started to investigate this question and some propose to disentangle exploration and exploitation into distinct phases [7,11,53]. QD presents a natural way of decoupling the optimization of exploitation (quality) and exploration (diversity) by looking for high-performing solutions in local niches of the behavioral space, leading to local competition between solutions instead of a global competition [14,32,35].…”
Section: Introductionmentioning
confidence: 99%