Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear

Lipton, Zachary C.; Azizzadenesheli, Kamyar; Kumar, Abhishek; Li, Lihong; Gao, Jianfeng; Deng, Li

doi:10.48550/arxiv.1611.01211

Cited by 14 publications

(18 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Exploration-exploitation trade-offs has been theoretically studied in RL but a prominent problem in high dimensional environments [45,5,12]. Recent success of Deep RL on Atari games [45], the board game Go [60], robotics [40], self-driving cars [59], and safety in RL [42] propose promises on deploying deep RL in high dimensional problem.…”

Section: Related Workmentioning

confidence: 99%

Efficient Exploration Through Bayesian Deep Q-Networks

Azizzadenesheli¹,

Brunskill

Anandkumar

2018

2018 Information Theory and Applications Workshop (ITA)

Self Cite

View full text Add to dashboard Cite

We study reinforcement learning (RL) in high dimensional episodic Markov decision processes (MDP). We consider value-based RL when the optimal Q-value is a linear function of d-dimensional state-action feature representation. For instance, in deep-Q networks (DQN), the Q-value is a linear function of the feature representation layer (output layer). We propose two algorithms, one based on optimism, LINUCB, and another based on posterior sampling, LINPSRL. We guarantee frequentist and Bayesian regret upper bounds of O(d √ T ) for these two algorithms, where T is the number of episodes. We extend these methods to deep RL and propose Bayesian deep Q-networks (BDQN), which uses an efficient Thompson sampling algorithm for high dimensional RL. We deploy the double DQN (DDQN) approach, and instead of learning the last layer of Q-network using linear regression, we use Bayesian linear regression, resulting in an approximated posterior over Q-function. This allows us to directly incorporate the uncertainty over the Q-function and deploy Thompson sampling on the learned posterior distribution resulting in efficient exploration/exploitation trade-off. We empirically study the behavior of BDQN on a wide range of Atari games. Since BDQN carries out more efficient exploration and exploitation, it is able to reach higher return substantially faster compared to DDQN.

show abstract

Section: Related Workmentioning

confidence: 99%

Efficient Exploration Through Bayesian Deep Q-Networks

Azizzadenesheli¹,

Brunskill

Anandkumar

2018

2018 Information Theory and Applications Workshop (ITA)

Self Cite

View full text Add to dashboard Cite

show abstract

“…This includes reversibility methods [Eysenbach et al, 2017], which avoid side effects in tasks that don't require irreversible actions. Safe exploration methods that penalize risk [Chow et al, 2015] or use intrinsic motivation [Lipton et al, 2016] help the agent avoid side effects that result in lower reward (such as getting trapped or damaged), but do not discourage the agent from damaging the environment in ways that are not penalized by the reward function (e.g. breaking vases).…”

Section: Resultsmentioning

confidence: 99%

Avoiding Side Effects By Considering Future Tasks

Krakovna,

Orseau,

Ngo

et al. 2020

Preprint

View full text Add to dashboard Cite

Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the task) as well as what not to do (side effects that should be avoided while completing the task). To alleviate the burden on the reward designer, we propose an algorithm to automatically generate an auxiliary reward function that penalizes side effects. This auxiliary objective rewards the ability to complete possible future tasks, which decreases if the agent causes side effects during the current task. The future task reward can also give the agent an incentive to interfere with events in the environment that make future tasks less achievable, such as irreversible actions by other agents. To avoid this interference incentive, we introduce a baseline policy that represents a default course of action (such as doing nothing), and use it to filter out future tasks that are not achievable by default. We formally define interference incentives and show that the future task approach with a baseline policy avoids these incentives in the deterministic case. Using gridworld environments that test for side effects and interference, we show that our method avoids interference and is more effective for avoiding side effects than the common approach of penalizing irreversible actions.

show abstract

“…These sim-to-real approaches are complementary to our LaND approach; the simulation policy can be used to initalize the real-world policy, while our method continues to finetune by learning from disengagements. Other reinforcement learning methods, including ours, learn directly from the robot's experiences [30], [31], [32], [33], [34], [35], [36], [37]. However, these methods typically assume catastrophic failures are acceptable [32], [37] or access to a safe controller [33], [34], the robot gathers data in a single area over multiple traversals [30], [32], [36], on-policy data collection [36], access to a reward signal beyond disengagement [30], perform their evaluations in the training environment [30], [36], or are only demonstrated in simulation [31], [35].…”

Section: Related Workmentioning

confidence: 99%

“…Other reinforcement learning methods, including ours, learn directly from the robot's experiences [30], [31], [32], [33], [34], [35], [36], [37]. However, these methods typically assume catastrophic failures are acceptable [32], [37] or access to a safe controller [33], [34], the robot gathers data in a single area over multiple traversals [30], [32], [36], on-policy data collection [36], access to a reward signal beyond disengagement [30], perform their evaluations in the training environment [30], [36], or are only demonstrated in simulation [31], [35]. In contrast, our LaND method is safe because it leverages the existing human-safety driver, learns from off-policy data, does not require retraversing an area multiple times, learns directly from whether the robot is engaged or disengaged, and evaluate in novel, never-beforeseen real-world environments.…”

Section: Related Workmentioning

confidence: 99%

LaND: Learning to Navigate from Disengagements

Kahn¹,

Abbeel²,

Levine³

2020

Preprint

View full text Add to dashboard Cite

Fig. 1: LaND is a learning-based approach for autonomous mobile robot navigation that directly learns from disengagements-any time a human monitor disengages the robot's autonomy. These disengagement datasets are ubiquitous because they are naturally collected during the process of testing these autonomous systems. LaND is able to navigate in a diverse set of sidewalk environments, including parked bicycles, dense foliage, parked cars, sun glare, sharp turns, and unexpected obstacles.

show abstract

Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear

Cited by 14 publications

References 11 publications

Efficient Exploration Through Bayesian Deep Q-Networks

Efficient Exploration Through Bayesian Deep Q-Networks

Avoiding Side Effects By Considering Future Tasks

LaND: Learning to Navigate from Disengagements

Contact Info

Product

Resources

About