2016
DOI: 10.48550/arxiv.1611.01211
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear

Zachary C. Lipton,
Kamyar Azizzadenesheli,
Abhishek Kumar
et al.

Abstract: Many practical environments contain catastrophic states that an optimal agent would visit infrequently or never. Even on toy problems, Deep Reinforcement Learning (DRL) agents tend to periodically revisit these states upon forgetting their existence under a new policy. We introduce intrinsic fear (IF), a learned reward shaping that guards DRL agents against periodic catastrophes. IF agents possess a fear model trained to predict the probability of imminent catastrophe. This score is then used to penalize the Q… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(18 citation statements)
references
References 11 publications
0
18
0
Order By: Relevance
“…Exploration-exploitation trade-offs has been theoretically studied in RL but a prominent problem in high dimensional environments [45,5,12]. Recent success of Deep RL on Atari games [45], the board game Go [60], robotics [40], self-driving cars [59], and safety in RL [42] propose promises on deploying deep RL in high dimensional problem.…”
Section: Related Workmentioning
confidence: 99%
“…Exploration-exploitation trade-offs has been theoretically studied in RL but a prominent problem in high dimensional environments [45,5,12]. Recent success of Deep RL on Atari games [45], the board game Go [60], robotics [40], self-driving cars [59], and safety in RL [42] propose promises on deploying deep RL in high dimensional problem.…”
Section: Related Workmentioning
confidence: 99%
“…This includes reversibility methods [Eysenbach et al, 2017], which avoid side effects in tasks that don't require irreversible actions. Safe exploration methods that penalize risk [Chow et al, 2015] or use intrinsic motivation [Lipton et al, 2016] help the agent avoid side effects that result in lower reward (such as getting trapped or damaged), but do not discourage the agent from damaging the environment in ways that are not penalized by the reward function (e.g. breaking vases).…”
Section: Resultsmentioning
confidence: 99%
“…These sim-to-real approaches are complementary to our LaND approach; the simulation policy can be used to initalize the real-world policy, while our method continues to finetune by learning from disengagements. Other reinforcement learning methods, including ours, learn directly from the robot's experiences [30], [31], [32], [33], [34], [35], [36], [37]. However, these methods typically assume catastrophic failures are acceptable [32], [37] or access to a safe controller [33], [34], the robot gathers data in a single area over multiple traversals [30], [32], [36], on-policy data collection [36], access to a reward signal beyond disengagement [30], perform their evaluations in the training environment [30], [36], or are only demonstrated in simulation [31], [35].…”
Section: Related Workmentioning
confidence: 99%
“…Other reinforcement learning methods, including ours, learn directly from the robot's experiences [30], [31], [32], [33], [34], [35], [36], [37]. However, these methods typically assume catastrophic failures are acceptable [32], [37] or access to a safe controller [33], [34], the robot gathers data in a single area over multiple traversals [30], [32], [36], on-policy data collection [36], access to a reward signal beyond disengagement [30], perform their evaluations in the training environment [30], [36], or are only demonstrated in simulation [31], [35]. In contrast, our LaND method is safe because it leverages the existing human-safety driver, learns from off-policy data, does not require retraversing an area multiple times, learns directly from whether the robot is engaged or disengaged, and evaluate in novel, never-beforeseen real-world environments.…”
Section: Related Workmentioning
confidence: 99%