2012
DOI: 10.1109/tsmcc.2012.2218595
|View full text |Cite
|
Sign up to set email alerts
|

Abstract: Policy gradient based actor-critic algorithms are amongst the most popular algorithms in the reinforcement learning framework. Their advantage of being able to search for optimal policies using low-variance gradient estimates has made them useful in several real-life applications, such as robotics, power control and finance. Although general surveys on reinforcement learning techniques already exist, no survey is specifically dedicated to actor-critic algorithms in particular. This paper therefore describes th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
349
0
2

Year Published

2015
2015
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 733 publications
(354 citation statements)
references
References 64 publications
3
349
0
2
Order By: Relevance
“…6.3) into sequences of simpler subtasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent HRL organizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor control maps (Ring et al, 2011) inspired by neurophysiological findings (Graziano, 2009 (Williams, 1986(Williams, , 1988(Williams, , 1992aSutton et al, 1999a;Baxter and Bartlett, 2001;Aberdeen, 2003;Ghavamzadeh and Mahadevan, 2003;Kohl and Stone, 2004;Wierstra et al, 2008;Rückstieß et al, 2008;Peters and Schaal, 2008b,a;Sehnke et al, 2010;Grüttner et al, 2010;Wierstra et al, 2010;Peters, 2010;Grondman et al, 2012;Heess et al, 2012). Gradients of the total reward with respect to policies (NN weights) are estimated (and then exploited) through repeated NN evaluations.…”
Section: Deep Hierarchical Rl (Hrl) and Subgoal Learning With Fnns Anmentioning
confidence: 99%
“…6.3) into sequences of simpler subtasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent HRL organizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor control maps (Ring et al, 2011) inspired by neurophysiological findings (Graziano, 2009 (Williams, 1986(Williams, , 1988(Williams, , 1992aSutton et al, 1999a;Baxter and Bartlett, 2001;Aberdeen, 2003;Ghavamzadeh and Mahadevan, 2003;Kohl and Stone, 2004;Wierstra et al, 2008;Rückstieß et al, 2008;Peters and Schaal, 2008b,a;Sehnke et al, 2010;Grüttner et al, 2010;Wierstra et al, 2010;Peters, 2010;Grondman et al, 2012;Heess et al, 2012). Gradients of the total reward with respect to policies (NN weights) are estimated (and then exploited) through repeated NN evaluations.…”
Section: Deep Hierarchical Rl (Hrl) and Subgoal Learning With Fnns Anmentioning
confidence: 99%
“…The most common methods utilised in the application of RL to continuous action-space problems are either actor-critic methods (AC) [3] or direct policy search (DPS) [4]. AC requires the approximation of two functions: the value function Q : S × A → R, giving the expected long term reward from being in a given state s ∈ S and taking an action a ∈ A, and the policy function π : S → A, which is a mapping from states to actions.…”
Section: Introductionmentioning
confidence: 99%
“…Instead the action to take from a given state is calculated as required from the Q function. Despite being the method of choice when the action-space is small or discrete [3], IPM is not frequently applied when the action-space is large or continuous. This is due to the fact that it becomes impossible to compare the values of every possible action, and it has been stated that applying optimisation to the action selection at every time-step would be prohibitively time consuming [3], [5].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…This allows the optimization of the system by searching the optimal policy with gradient descent methods. This architecture has been extended to for more complex scenarios in continuous time, a review of different actor-critic schemas is reported in [137]. Figure 9.…”
Section: Actor-critic Structurementioning
confidence: 99%