Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration

Zhang, Jingwei; Wetzel, Niklas; Dorka, Nicolai; Boedecker, Joschka; Burgard, Wolfram

doi:10.48550/arxiv.1903.07400

Cited by 10 publications

(10 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Asymptotic Inconsistency. Approaches that define IR as the difference between state representations ψ(s) − ψ(s ) (ψ is a learned embedding network) (Zhang et al, 2019;Marino et al, 2019) suffer from asymptotic inconsistency. In other words, their IR does not vanish even after sufficient exploration: r i → 0 when N → ∞.…”

Section: Conceptual Advantages Of Bebold Over Existing Criteriamentioning

confidence: 99%

“…For this, (Zhang et al, 2019) proposes to learn a separate scheduler to switch between intrinsic and extrinsic rewards, and divides the state representation difference by the square root of visitation counts. In comparison, BeBold does not require any extra stage and is a much simpler solution.…”

Section: Conceptual Advantages Of Bebold Over Existing Criteriamentioning

confidence: 99%

“…Recent approaches have proposed to use intrinsic rewards (IR) (Schmidhuber, 2010) to motivate agents for exploration before any extrinsic rewards are obtained. Various criteria have been proposed, including curiosity/surprise-driven (Pathak et al, 2017), count-based (Bellemare et al, 2016;Burda et al, 2018a;b;Badia et al, 2020b), and state-diff approaches (Zhang et al, 2019;Marino et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

BeBold: Exploration Beyond the Boundary of Explored Regions

Zhang¹,

Xu²,

Wang³

et al. 2020

Preprint

View full text Add to dashboard Cite

Efficient exploration under sparse rewards remains a key challenge in deep reinforcement learning. To guide exploration, previous work makes extensive use of intrinsic reward (IR). There are many heuristics for IR, including visitation counts, curiosity, and state-difference. In this paper, we analyze the pros and cons of each method and propose the regulated difference of inverse visitation counts as a simple but effective criterion for IR. The criterion helps the agent explore Beyond the Boundary of explored regions and mitigates common issues in count-based methods, such as short-sightedness and detachment. The resulting method, Be-Bold, solves the 12 most challenging procedurally-generated tasks in MiniGrid with just 120M environment steps, without any curriculum learning. In comparison, previous SoTA only solves 50% of the tasks. BeBold also achieves SoTA on multiple tasks in NetHack, a popular rogue-like game that contains more challenging procedurally-generated environments.

show abstract

Section: Conceptual Advantages Of Bebold Over Existing Criteriamentioning

confidence: 99%

Section: Conceptual Advantages Of Bebold Over Existing Criteriamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

BeBold: Exploration Beyond the Boundary of Explored Regions

Zhang¹,

Xu²,

Wang³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…After training the pure-exploratory policy with intrinsic rewards, there are several ways to combine the intrinsic policy with extrinsic policies that are trained by other extrinsic rewards to enhance the performance. Scheduled intrinsic drive [34] uses a high-level scheduler that periodically selects to follow either the extrinsic or the intrinsic policy to gather experiences. MuleX [35] learns several policies independently and uses a random heuristic to decide which one to use in each time step.…”

Section: Explorationmentioning

confidence: 99%

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

Bai,

Liu,

Liu

et al. 2020

Preprint

View full text Add to dashboard Cite

Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the nextstate prediction under the condition of the current state, action, and latent variable. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-theart environment model-based exploration approaches.

show abstract

“…While this might work when the two objectives are somewhat aligned, it may be inefficient when they are not [41]. Several works have started to investigate this question and some propose to disentangle exploration and exploitation into distinct phases [7,11,53]. QD presents a natural way of decoupling the optimization of exploitation (quality) and exploration (diversity) by looking for high-performing solutions in local niches of the behavioral space, leading to local competition between solutions instead of a global competition [14,32,35].…”

Section: Introductionmentioning

confidence: 99%

Scaling MAP-Elites to Deep Neuroevolution

Colas,

Huizinga,

Madhavan

et al. 2020

Preprint

View full text Add to dashboard Cite

Quality-Diversity (QD) algorithms, and MAP-Elites (ME) in particular, have proven very useful for a broad range of applications including enabling real robots to recover quickly from joint damage, solving strongly deceptive maze tasks or evolving robot morphologies to discover new gaits. However, present implementations of ME and other QD algorithms seem to be limited to low-dimensional controllers with far fewer parameters than modern deep neural network models. In this paper, we propose to leverage the efficiency of Evolution Strategies (ES) to scale MAP-Elites to high-dimensional controllers parameterized by large neural networks. We design and evaluate a new hybrid algorithm called MAP-Elites with Evolution Strategies (ME-ES) for post-damage recovery in a difficult highdimensional control task where traditional ME fails. Additionally, we show that ME-ES performs efficient exploration, on par with state-of-the-art exploration algorithms in high-dimensional control tasks with strongly deceptive rewards..

show abstract

Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration

Cited by 10 publications

References 26 publications

BeBold: Exploration Beyond the Boundary of Explored Regions

BeBold: Exploration Beyond the Boundary of Explored Regions

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

Scaling MAP-Elites to Deep Neuroevolution

Contact Info

Product

Resources

About