“…6.3) into sequences of simpler subtasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent HRL organizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor control maps (Ring et al, 2011) inspired by neurophysiological findings (Graziano, 2009 (Williams, 1986(Williams, , 1988(Williams, , 1992aSutton et al, 1999a;Baxter and Bartlett, 2001;Aberdeen, 2003;Ghavamzadeh and Mahadevan, 2003;Kohl and Stone, 2004;Wierstra et al, 2008;Rückstieß et al, 2008;Peters and Schaal, 2008b,a;Sehnke et al, 2010;Grüttner et al, 2010;Wierstra et al, 2010;Peters, 2010;Grondman et al, 2012;Heess et al, 2012). Gradients of the total reward with respect to policies (NN weights) are estimated (and then exploited) through repeated NN evaluations.…”