Recurrent policy gradients

Wierstra, Daan; Förster, Alexander; Peters, Jan; Schmidhuber, Jürgen

doi:10.1093/jigpal/jzp049

Cited by 79 publications

(78 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…6.3) into sequences of simpler subtasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent HRL organizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor control maps (Ring et al, 2011) inspired by neurophysiological findings (Graziano, 2009 (Williams, 1986(Williams, , 1988(Williams, , 1992aSutton et al, 1999a;Baxter and Bartlett, 2001;Aberdeen, 2003;Ghavamzadeh and Mahadevan, 2003;Kohl and Stone, 2004;Wierstra et al, 2008;Rückstieß et al, 2008;Peters and Schaal, 2008b,a;Sehnke et al, 2010;Grüttner et al, 2010;Wierstra et al, 2010;Peters, 2010;Grondman et al, 2012;Heess et al, 2012). Gradients of the total reward with respect to policies (NN weights) are estimated (and then exploited) through repeated NN evaluations.…”

Section: Deep Hierarchical Rl (Hrl) and Subgoal Learning With Fnns Anmentioning

confidence: 99%

Deep learning in neural networks: An overview

2015

View full text Add to dashboard Cite

In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.LATEX source: http://www.idsia.ch/˜juergen/DeepLearning8Oct2014.tex Complete BIBTEX file (888 kB): http://www.idsia.ch/˜juergen/deep.bib Preface This is the preprint of an invited Deep Learning (DL) overview. One of its goals is to assign credit to those who contributed to the present state of the art. I acknowledge the limitations of attempting to achieve this goal. The DL research community itself may be viewed as a continually evolving, deep network of scientists who have influenced each other in complex ways. Starting from recent DL results, I tried to trace back the origins of relevant ideas through the past half century and beyond, sometimes using "local search" to follow citations of citations backwards in time. Since not all DL publications properly acknowledge earlier relevant work, additional global search strategies were employed, aided by consulting numerous neural network experts. As a result, the present preprint mostly consists of references. Nevertheless, through an expert selection bias I may have missed important work. A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century. For these reasons, this work should be viewed as merely a snapshot of an ongoing credit assignment process. To help improve it, please do not hesitate to send corrections and suggestions to juergen@idsia.ch.

show abstract

Section: Deep Hierarchical Rl (Hrl) and Subgoal Learning With Fnns Anmentioning

confidence: 99%

Deep learning in neural networks: An overview

2015

View full text Add to dashboard Cite

show abstract

“…In this work, we build on advances in policy gradient reinforcement learning, specifically the REINFORCE algorithm (Williams, 1992;Sutton et al, 2000;Peters & Schaal, 2008;Wierstra et al, 2009), to demonstrate reward-based training of recurrent neural networks (RNNs) for several well-known experimental paradigms in systems neuroscience. The networks consist of two modules in an "actor-critic" architecture (Barto et al, 1983;Grondman et al, 2012), in which a policy network uses inputs provided by the environment to select actions that maximize reward, while a value network uses the selected actions and activity of the policy network to predict future reward and guide learning.…”

Section: Introductionmentioning

confidence: 99%

“…Indeed, as in Dayan & Daw (2008) one of the goals of this work is to unify related computations into a common language that is applicable to a wide range of tasks in systems neuroscience. However, the formulation using policies represented by RNNs allows for a far more general description, and, in particular, makes the assumption of a Markovian environment unnecessary (Wierstra et al, 2009). Such policies can also be compared more directly to "optimal" solutions when they are known, for instance to the signal detection theory account of perceptual decision-making (Gold & Shadlen, 2007).…”

Section: Introductionmentioning

confidence: 99%

Reward-based training of recurrent neural networks for cognitive and value-based tasks

Song

Yang

Wang

2016

Preprint

View full text Add to dashboard Cite

Trained neural network models, which exhibit many features observed in neural recordings from behaving animals and whose activity and connectivity can be fully analyzed, may provide insights into neural mechanisms. In contrast to commonly used methods for supervised learning from graded error signals, however, animals learn from reward feedback on definite actions through reinforcement learning. Reward maximization is particularly relevant when the optimal behavior depends on an animal's internal judgment of confidence or subjective preferences. Here, we describe reward-based training of recurrent neural networks in which a value network guides learning by using the selected actions and activity of the policy network to predict future reward. We show that such models capture both behavioral and electrophysiological findings from well-known experimental paradigms. Our results provide a unified framework for investigating diverse cognitive and value-based computations, including a role for value representation that is essential for learning, but not executing, a task.

show abstract

“…8 These techniques learn to map observations directly to actions and they use their internal memory to summarise important information from the past observations. For example, Wierstra et al [2010], used recurrent neural networks (RNN), to approximate the policy. At each step the RNN updates its internal memory and proposes a new system action based on the accumulated information in the internal memory and the last observation.…”

Section: Discussionmentioning

confidence: 99%

Natural actor and belief critic

Jurčíček

Thomson

Young

2011

ACM Trans. Speech Lang. Process.

View full text Add to dashboard Cite

This article presents a novel algorithm for learning parameters in statistical dialogue systems which are modeled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy that selects the system's responses based on the inferred state; and a reward function that specifies the desired behavior of the system. Ideally both the model parameters and the policy would be designed to maximize the cumulative reward. However, while there are many techniques available for learning the optimal policy, no good ways of learning the optimal model parameters that scale to real-world dialogue systems have been found yet.The presented algorithm, called the Natural Actor and Belief Critic (NABC), is a policy gradient method that offers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected cumulative reward. The resulting gradient is then used to adapt both the prior distribution of the dialogue model parameters and the policy parameters. In addition, the article presents a variant of the NABC algorithm, called the Natural Belief Critic (NBC), which assumes that the policy is fixed and only the model parameters need to be estimated. The algorithms are evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximize the expected cumulative reward result in significantly improved performance compared to the baseline hand-crafted model parameters. The algorithms are also compared to optimization techniques using plain gradients and state-of-the-art random search algorithms. In all cases, the algorithms based on the natural gradient work significantly better.

show abstract

Recurrent policy gradients

Cited by 79 publications

References 21 publications

Deep learning in neural networks: An overview

Deep learning in neural networks: An overview

Reward-based training of recurrent neural networks for cognitive and value-based tasks

Natural actor and belief critic

Contact Info

Product

Resources

About