“…Policy gradient methods are a notable exception to this statement. Starting with the pioneering work 1 of Gullapali and colleagues (Benbrahim & Franklin, 1997;Gullapalli, Franklin, & Benbrahim, 1994) in the early 1990s, these methods have been applied to a variety of robot learning problems ranging from simple control tasks (e.g., balancing a ball on a beam (Benbrahim, Doleac, Franklin, & Selfridge, 1992), and pole balancing (Kimura & Kobayashi, 1998)) to complex learning tasks involving many degrees of freedom such as learning of complex motor skills (Gullapalli et al, 1994;Mitsunaga, Smith, Kanda, Ishiguro, & Hagita, 2005;Miyamoto et al, 1995Miyamoto et al, , 1996Peters & Schaal, 2006;Peters, Vijayakumar, & Schaal, 2005a) and locomotion (Endo, Morimoto, Matsubara, Nakanishi, & Cheng, 2005;Kimura & Kobayashi, 1997;Kohl & Stone, 2004;Mori, Nakamura, aki Sato, & Ishii, 2004;Sato, Nakamura, & Ishii, 2002;Tedrake, Zhang, & Seung, 2005).…”