2019
DOI: 10.48550/arxiv.1912.05510
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments

Abstract: All living organisms struggle against the forces of nature to carve out a maintainable niche. We propose that such a search for order amidst chaos might offer a unifying principle for the emergence of useful behaviors in artificial agents. We formalize this idea into an unsupervised reinforcement learning method called Surprise Minimizing RL (SMiRL). SMiRL alternates between learning a density model to evaluate the surprise of a stimulus, and improving the policy to seek more predictable stimuli. This process … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
23
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(23 citation statements)
references
References 25 publications
(36 reference statements)
0
23
0
Order By: Relevance
“…SMiRL is useful when the environment provides sufficient unexpected and novel events for learning where the challenge for the agent is to maintain a steady equilibrium state. [3] SMiRL maintains a distribution 𝑝 πœƒ (𝑠) about which states are likely under its current policy. The agent then modifies its policy πœ‹ so that it encounters states 𝑠 with high 𝑝 πœƒ (𝑠), as well as to seek out states that will change the model 𝑝 πœƒ (𝑠) so that future states are more likely.…”
Section: Surprise Minimizing Reinforcement Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…SMiRL is useful when the environment provides sufficient unexpected and novel events for learning where the challenge for the agent is to maintain a steady equilibrium state. [3] SMiRL maintains a distribution 𝑝 πœƒ (𝑠) about which states are likely under its current policy. The agent then modifies its policy πœ‹ so that it encounters states 𝑠 with high 𝑝 πœƒ (𝑠), as well as to seek out states that will change the model 𝑝 πœƒ (𝑠) so that future states are more likely.…”
Section: Surprise Minimizing Reinforcement Learningmentioning
confidence: 99%
“…To account for this we use an augmented MDP that captures this notion. [3] We note that in our implementation of SMiRL, 𝑝 πœƒ 𝑑 (𝑠) is normally distributed. To construct the augmented MDP we include sufficient statistics for 𝑝 πœƒ 𝑑 (𝑠) in the state space such as the paramaeters of our normal distribution and the number of states seen so far.…”
Section: Smirl Reward Formulationmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, we allow for lower variability on a perceptual variable (speed indicator) but higher variability on our actions (press or release break) to achieve our control objective. Other related frameworks, such as planning-as-inference, active inference, surprise-minimising RL and KL control are also based on the idea that goal-directed behavior amounts to reducing the entropy or variance of the final (goal) state(s) of the controlled dynamical system [1,4,13,2,31,18]. For example, when balancing an inverted pendulum, the task ends in one single state, which is the one with the pendulum up and still.…”
Section: Introductionmentioning
confidence: 99%
“…These many desires shape the way organisms interact with their environment, encouraging them to discover new things but also to protect themselves, avoiding over-surprising events with mechanisms like fear [19]. Berseth et al [5] exemplified how to exploit such priors by implementing a "homeostasis" objective for RL, thereby showing how different from "novelty seeking" these priors can be. Eventually, the resource constraints stop organisms from exploring exhaustively their environment and push them to transfer knowledge from past experience.…”
Section: Introductionmentioning
confidence: 99%