Evolution Strategies for Direct Policy Search

Heidrich-Meisner, Verena; Igel, Christian

doi:10.1007/978-3-540-87700-4_43

Cited by 38 publications

(31 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Each of these three papers compare performance on only a single simple task with a few settings: mountain car with and without observation noise for fixed and random start states [25], pole balancing with no noise and a random start state [24], and double pole balancing with no noise and a fixed start state [23]. This article differs not only in terms of methods compared, but also because we consider more settings (such as evaluating multiple levels of effector noise), perform tests on the significantly more complex task of keepaway, and form domain-independent conclusions about the two classes of methods considered.…”

Section: Related Workmentioning

confidence: 99%

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Whiteson

Taylor

Stone

2009

Auton Agent Multi-Agent Syst

View full text Add to dashboard Cite

Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address this shortcoming by presenting results of empirical comparisons between Sarsa and NEAT, two representative methods, in mountain car and keepaway, two benchmark reinforcement learning tasks. In each task, the methods are evaluated in combination with both linear and nonlinear representations to determine their best configurations. In addition, this article tests two specific hypotheses about the critical factors contributing to these methods' relative performance: (1) that sensor noise reduces the final performance of Sarsa more than that of NEAT, because Sarsa's learning updates are not reliable in the absence of the Markov property and (2) that stochasticity, by introducing noise in fitness estimates, reduces the learning speed of NEAT more than that of Sarsa. Experiments in variations of mountain car and keepaway designed to isolate these factors confirm both these hypotheses.

show abstract

Section: Related Workmentioning

confidence: 99%

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Whiteson

Taylor

Stone

2009

Auton Agent Multi-Agent Syst

View full text Add to dashboard Cite

show abstract

“…In this study, we consider the covariance matrix evolution strategy (CMA-ES, [11,8,27]) for direct policy search, which gives striking results on RL benchmark problems [15,7,14,12]. The CMA-ES adapts the policy as well as parameters of its own search strategy (such as a variable metric) based on ranking policies.…”

Section: Introductionmentioning

confidence: 99%

“…Evolution strategies have proven to be powerful methods for reinforcement learning (e.g., see [15,19,24,7,14,12]). It has been argued that they are more robust against noise Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.…”

Section: Introductionmentioning

confidence: 99%

Uncertainty handling CMA-ES for reinforcement learning

Heidrich-Meisner

Igel

2009

Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation

Self Cite

View full text Add to dashboard Cite

The covariance matrix adaptation evolution strategy (CMA-ES) has proven to be a powerful method for reinforcement learning (RL). Recently, the CMA-ES has been augmented with an adaptive uncertainty handling mechanism. Because uncertainty is a typical property of RL problems this new algorithm, termed UH-CMA-ES, is promising for RL. The UH-CMA-ES dynamically adjusts the number of episodes considered in each evaluation of a policy. It controls the signal to noise ratio such that it is just high enough for a sufficiently good ranking of candidate policies, which in turn allows the evolutionary learning to find better solutions. This significantly increases the learning speed as well as the robustness without impairing the quality of the final solutions. We evaluate the UH-CMA-ES on fully and partially observable Markov decision processes with random start states and noisy observations. A canonical natural policy gradient method and random search serve as a baseline for comparison.

show abstract

“…Recently, Heidrich-Meisner and Igel (2008a, 2008b, 2008c performed a systematic comparison between the CMA-ES and policy gradient methods with variable metrics. They discuss similarities and differences between these related approaches.…”

Section: Reinforcement Learningmentioning

confidence: 99%

“…Although ESs can be applied to various kinds of machine learning problems, the most elaborate variants are specialized for real-valued parameter spaces. Exemplary applications include supervised learning of feed-forward and recurrent neural networks, direct policy search in reinforcement learning, and model selection for kernel machines (e.g., Mandischer 2002;Igel et al 2001;Schneider et al 2004;Igel 2003;Friedrichs and Igel 2005;Kassahun and Sommer 2005;Pellecchia et al 2005;Mersch et al 2007;Siebel and Sommer 2007;Heidrich-Meisner and Igel 2008a, 2008b, 2008cGlasmachers and Igel 2008, see below).…”

Section: Introductionmentioning

confidence: 99%

Efficient covariance matrix update for variable metric evolution strategies

2009

Self Cite

View full text Add to dashboard Cite

Randomized direct search algorithms for continuous domains, such as evolution strategies, are basic tools in machine learning. They are especially needed when the gradient of an objective function (e.g., loss, energy, or reward function) cannot be computed or estimated efficiently. Application areas include supervised and reinforcement learning as well as model selection. These randomized search strategies often rely on normally distributed additive variations of candidate solutions. In order to efficiently search in non-separable and ill-conditioned landscapes the covariance matrix of the normal distribution must be adapted, amounting to a variable metric method. Consequently, covariance matrix adaptation (CMA) is considered state-of-the-art in evolution strategies. In order to sample the normal distribution, the adapted covariance matrix needs to be decomposed, requiring in general (n 3 ) operations, where n is the search space dimension. We propose a new update mechanism which can replace a rank-one covariance matrix update and the computationally expensive decomposition of the covariance matrix. The newly developed update rule reduces the computational complexity of the rank-one covariance matrix adaptation to (n 2 ) without resorting to outdated distributions. We derive new versions of the elitist covariance matrix adaptation evolution strategy (CMA-ES) and the multi-objective CMA-ES. These algorithms are equivalent to the original procedures except that the update step for the variable metric distribution scales better in the problem dimension. We also introduce a simplified variant of the non-elitist CMA-ES with the incremental covariance matrix update and investigate its performance. Apart from the reduced time-complexity of the distribution update, the algebraic computations involved in all new algorithms are simpler compared to the original Learn (2009) 75: 167-197 versions. The new update rule improves the performance of the CMA-ES for large scale machine learning problems in which the objective function can be evaluated fast.

show abstract

Evolution Strategies for Direct Policy Search

Cited by 38 publications

References 14 publications

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Uncertainty handling CMA-ES for reinforcement learning

Efficient covariance matrix update for variable metric evolution strategies

Contact Info

Product

Resources

About