Time Limits in Reinforcement Learning

Pardo, Fabio; Tavakoli, Arash; Levdik, Vitaly; Kormushev, Petar

doi:10.48550/arxiv.1712.00378

Cited by 11 publications

(16 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Negative reward is given to invalid action which either attempts to insert when m k = 0 OR leads to trapped state. The formulation is analogous to an agent navigating a grid-world with random obstacles with limited number of steps, where the grids are replaced with different realization of the connectomes [35]. The reward function intends to guide the DQN agent to make modification to connectome at appropriate steps leading to maximum reward at the end step while avoiding either inserting too many or too few connections.…”

Section: Connectomementioning

confidence: 99%

Deep Reinforcement Learning for Neural Control

Kim¹,

Shlizerman²

2020

Preprint

View full text Add to dashboard Cite

We present a novel methodology for control of neural circuits based on deep reinforcement learning. Our approach achieves aimed behavior by generating external continuous stimulation of existing neural circuits (neuromodulation control) or modulations of neural circuits architecture (connectome control). Both forms of control are challenging due to nonlinear and recurrent complexity of neural activity. To infer candidate control policies, our approach maps neural circuits and their connectome into a grid-world like setting and infers the actions needed to achieve aimed behavior. The actions are inferred by adaptation of deep Q-learning methods known for their robust performance in navigating grid-worlds. We apply our approach to the model of C. elegans which simulates the full somatic nervous system with muscles and body. Our framework successfully infers neuropeptidic currents and synaptic architectures for control of chemotaxis. Our findings are consistent with in vivo measurements and provide additional insights into neural control of chemotaxis. We further demonstrate the generality and scalability of our methods by inferring chemotactic neural circuits from scratch.Preprint. Under review.

show abstract

Section: Connectomementioning

confidence: 99%

Deep Reinforcement Learning for Neural Control

Kim¹,

Shlizerman²

2020

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Work and Backgroundmentioning

confidence: 99%

“…Our work is most closely related to Pardo et al (2017) and Zintgraf et al (2019). Pardo et al (2017) study the impact of fixed time limits and time-awareness on deep reinforcement learning agents. They propose using a timestamp as part of the state representation in order to avoid state-aliasing and the non-Markovianity resulting from a finite horizon treatment of an infinite horizon problem.…”

Section: Related Work and Backgroundmentioning

confidence: 99%

Learning Not to Learn: Nature versus Nurture in Silico

Lange¹,

Sprekeler²

2020

Preprint

View full text Add to dashboard Cite

Animals are equipped with a rich innate repertoire of sensory, behavioral and motor skills, which allows them to interact with the world immediately after birth. At the same time, many behaviors are highly adaptive and can be tailored to specific environments by means of learning. In this work, we use mathematical analysis and the framework of meta-learning (or 'learning to learn') to answer when it is beneficial to learn such an adaptive strategy and when to hard-code a heuristic behavior. We find that the interplay of ecological uncertainty, task complexity and the agents' lifetime has crucial effects on the meta-learned amortized Bayesian inference performed by an agent. There exist two regimes: One in which metalearning yields a learning algorithm that implements task-dependent informationintegration and a second regime in which meta-learning imprints a heuristic or 'hard-coded' behavior. Further analysis reveals that non-adaptive behaviors are not only optimal for aspects of the environment that are stable across individuals, but also in situations where an adaptation to the environment would in fact be highly beneficial, but could not be done quickly enough to be exploited within the remaining lifetime. Hard-coded behaviors should hence not only be those that always work, but also those that are too complex to be learned within a reasonable time frame.

show abstract

“…The episode terminates once the agent is within 1 meter of the goal. We also terminate if the agent has failed to reach the goal after 20 time steps, but treat the two types of termination differently when computing the TD error (see Pardo et al (2017)). Note that it is challenging to specify a meaningful distance metric and local policy on pixel inputs, so it is difficult to apply standard planning algorithms to this task.…”

Section: Didactic Example: 2d Navigationmentioning

confidence: 99%

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Eysenbach¹,

Salakhutdinov²,

Levine³

2019

Preprint

View full text Add to dashboard Cite

The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a sub-optimal solution. We introduce a general-purpose control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a particular subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment -namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goal-conditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms. 1 Recent work has introduced goal-conditioned RL algorithms (Pong et al., 2018;Schaul et al., 2015) that acquire a single policy for reaching many goals.

show abstract

Time Limits in Reinforcement Learning

Cited by 11 publications

References 22 publications

Deep Reinforcement Learning for Neural Control

Deep Reinforcement Learning for Neural Control

Learning Not to Learn: Nature versus Nurture in Silico

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Contact Info

Product

Resources

About