Fast gradient-descent methods for temporal-difference learning with linear function approximation

Sutton, Richard S.; Maei, Hamid Reza; Precup, Doina; Bhatnagar, Shalabh; Silver, David; Szepesvári, Csaba; Wiewiora, Eric

doi:10.1145/1553374.1553501

Cited by 578 publications

(948 citation statements)

References 17 publications

Supporting

Mentioning

942

Contrasting

Unclassified

Order By: Relevance

“…Many variants of traditional RL exist (e.g., Barto et al, 1983;Watkins, 1989;Watkins and Dayan, 1992;Moore and Atkeson, 1993;Schwartz, 1993;Rummery and Niranjan, 1994;Singh, 1994;Baird, 1995;Kaelbling et al, 1995;Peng and Williams, 1996;Mahadevan, 1996;Tsitsiklis and van Roy, 1996;Bradtke et al, 1996;Santamaría et al, 1997;Prokhorov and Wunsch, 1997;Sutton and Barto, 1998;Wiering and Schmidhuber, 1998b;Baird and Moore, 1999;Meuleau et al, 1999;Morimoto and Doya, 2000;Bertsekas, 2001;Brafman and Tennenholtz, 2002;Abounadi et al, 2002;Lagoudakis and Parr, 2003;Sutton et al, 2008;Maei and Sutton, 2010;van Hasselt, 2012). Most are formulated in a probabilistic framework, and evaluate pairs of input and output (action) events (instead of input events only).…”

Section: Deep Fnns For Traditional Rl and Markov Decision Processes (mentioning

confidence: 99%

“…6.3) into sequences of simpler subtasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent HRL organizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor control maps (Ring et al, 2011) inspired by neurophysiological findings (Graziano, 2009 (Williams, 1986(Williams, , 1988(Williams, , 1992aSutton et al, 1999a;Baxter and Bartlett, 2001;Aberdeen, 2003;Ghavamzadeh and Mahadevan, 2003;Kohl and Stone, 2004;Wierstra et al, 2008;Rückstieß et al, 2008;Peters and Schaal, 2008b,a;Sehnke et al, 2010;Grüttner et al, 2010;Wierstra et al, 2010;Peters, 2010;Grondman et al, 2012;Heess et al, 2012). Gradients of the total reward with respect to policies (NN weights) are estimated (and then exploited) through repeated NN evaluations.…”

Section: Deep Hierarchical Rl (Hrl) and Subgoal Learning With Fnns Anmentioning

confidence: 99%

See 1 more Smart Citation

Deep learning in neural networks: An overview

2015

View full text Add to dashboard Cite

In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.LATEX source: http://www.idsia.ch/˜juergen/DeepLearning8Oct2014.tex Complete BIBTEX file (888 kB): http://www.idsia.ch/˜juergen/deep.bib Preface This is the preprint of an invited Deep Learning (DL) overview. One of its goals is to assign credit to those who contributed to the present state of the art. I acknowledge the limitations of attempting to achieve this goal. The DL research community itself may be viewed as a continually evolving, deep network of scientists who have influenced each other in complex ways. Starting from recent DL results, I tried to trace back the origins of relevant ideas through the past half century and beyond, sometimes using "local search" to follow citations of citations backwards in time. Since not all DL publications properly acknowledge earlier relevant work, additional global search strategies were employed, aided by consulting numerous neural network experts. As a result, the present preprint mostly consists of references. Nevertheless, through an expert selection bias I may have missed important work. A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century. For these reasons, this work should be viewed as merely a snapshot of an ongoing credit assignment process. To help improve it, please do not hesitate to send corrections and suggestions to juergen@idsia.ch.

show abstract

Section: Deep Fnns For Traditional Rl and Markov Decision Processes (mentioning

confidence: 99%

Section: Deep Hierarchical Rl (Hrl) and Subgoal Learning With Fnns Anmentioning

confidence: 99%

Deep learning in neural networks: An overview

2015

View full text Add to dashboard Cite

show abstract

“…which is modeled as a probability distribution in order to incorporate exploratory actions; for some special problems, the optimal solution to a control problem is actually a stochastic controller, see e.g., Sutton, McAllester, Singh, and Mansour (2000).…”

Section: General Assumptions and Problem Statementmentioning

confidence: 99%

“…This allows two variations of the previous algorithm which are known as the policy gradient theorem (Sutton et al, 2000) …”

Section: Policy Gradient Theorem and G(po)mdpmentioning

confidence: 99%

Reinforcement learning of motor skills with policy gradients

2008

View full text Add to dashboard Cite

a b s t r a c tAutonomous learning is one of the hallmarks of human and animal behavior, and understanding the principles of learning will be crucial in order to achieve true autonomy in advanced machines like humanoid robots. In this paper, we examine learning of complex motor skills with human-like limbs. While supervised learning can offer useful tools for bootstrapping behavior, e.g., by learning from demonstration, it is only reinforcement learning that offers a general approach to the final trial-and-error improvement that is needed by each individual acquiring a skill. Neither neurobiological nor machine learning studies have, so far, offered compelling results on how reinforcement learning can be scaled to the high-dimensional continuous state and action spaces of humans or humanoids. Here, we combine two recent research developments on learning motor control in order to achieve this scaling. First, we interpret the idea of modular motor control by means of motor primitives as a suitable way to generate parameterized control policies for reinforcement learning. Second, we combine motor primitives with the theory of stochastic policy gradient learning, which currently seems to be the only feasible framework for reinforcement learning for humanoids. We evaluate different policy gradient methods with a focus on their applicability to parameterized motor primitives. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm.

show abstract

“…The RL methods developed so far can be categorized into two types: Policy iteration where policies are learned based on value function approximation [21,12] and policy search where policies are learned directly to maximize expected future rewards [24,4,22,8,15,26]. …”

Section: Introductionmentioning

confidence: 99%

Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation

et al. 2014

View full text Add to dashboard Cite

The goal of reinforcement learning (RL) is to let an agent learn an optimal control policy in an unknown environment so that future expected rewards are maximized. The model-free RL approach directly learns the policy based on data samples. Although using many samples tends to improve the accuracy of policy learning, collecting a large number of samples is often expensive in practice. On the other hand, the model-based RL approach first estimates the transition model of the environment and then learns the policy based on the estimated transition model. Thus, if the transition model is accurately learned from a small amount of data, the model-based approach can perform better than the model-free approach. In this paper, we propose a novel model-based RL method by combining a recently proposed model-free policy search method called policy gradients with parameter-based exploration and the state-of-the-art transition model estimator called least-squares conditional density estimation. Through experiments, we demonstrate the practical usefulness of the proposed method.

show abstract

Fast gradient-descent methods for temporal-difference learning with linear function approximation

Cited by 578 publications

References 17 publications

Deep learning in neural networks: An overview

Deep learning in neural networks: An overview

Reinforcement learning of motor skills with policy gradients

Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation

Contact Info

Product

Resources

About