Policy Search for Path Integral Control

Abstract-Policy Search (PS) algorithms are nowadays widely used for their simplicity and effectiveness in finding solutions for robotic problems. However, most current PS algorithms derive policies by statistically fitting the data from the best experiments only. This means that those experiments yielding a poor performance are usually discarded or given too little influence on the policy update. In this paper, we propose a generalization of the Relative Entropy Policy Search (REPS) algorithm that takes bad experiences into consideration when computing a policy. The proposed approach, named Dual REPS (DREPS) following the philosophical interpretation of the duality between good and bad, finds clusters of experimental data yielding a poor behavior and adds them to the optimization problem as a repulsive constraint. Thus, considering there is a duality between good and bad data samples, both are taken into account in the stochastic search for a policy. Additionally, a cluster with the best samples may be included as an attractor to enforce faster convergence to a single optimal solution in multi-modal problems. We first tested our proposed approach in a simulated Reinforcement Learning (RL) setting and found that DREPS considerably speeds up the learning process, especially during the early optimization steps and in cases where other approaches get trapped in between several alternative maxima. Further experiments in which a real robot had to learn a task with a multi-modal reward function confirm the advantages of our proposed approach with respect to REPS.

show abstract

“…11. Learning curves for REPS and DREPS with the reward function in (14). Policy updates were calculated after every 50 rollouts.…”

Section: B Real Robot Multi-modal Problemmentioning

confidence: 99%

“…In the following section, we will detail how to build a clustered data structure for DREPS, followed by the algorithm's derivation, which is done similarly to REPS and other information-theoretic Policy Search approaches [14].…”

Section: Introductionmentioning

confidence: 99%

Dual REPS: A Generalization of Relative Entropy Policy Search Exploiting Bad Experiences

Colomé

Torras

2017

IEEE Trans. Robot.

View full text Add to dashboard Cite

show abstract

“…The unique traits of the LSOC framework have been exploited to derive a class of so called PIC methods. The interested reader is referred to earlier references [ 26 , 27 , 28 , 30 , 32 , 41 , 43 , 44 , 45 , 46 , 47 ]. An overview of applications was already given in the introduction.…”

Section: Path Integral Controlmentioning

confidence: 99%

On Entropy Regularized Path Integral Control for Trajectory Optimization

Lefebvre

Crevecoeur

2020

Entropy

View full text Add to dashboard Cite

In this article, we present a generalized view on Path Integral Control (PIC) methods. PIC refers to a particular class of policy search methods that are closely tied to the setting of Linearly Solvable Optimal Control (LSOC), a restricted subclass of nonlinear Stochastic Optimal Control (SOC) problems. This class is unique in the sense that it can be solved explicitly yielding a formal optimal state trajectory distribution. In this contribution, we first review the PIC theory and discuss related algorithms tailored to policy search in general. We are able to identify a generic design strategy that relies on the existence of an optimal state trajectory distribution and finds a parametric policy by minimizing the cross-entropy between the optimal and a state trajectory distribution parametrized by a parametric stochastic policy. Inspired by this observation, we then aim to formulate a SOC problem that shares traits with the LSOC setting yet that covers a less restrictive class of problem formulations. We refer to this SOC problem as Entropy Regularized Trajectory Optimization. The problem is closely related to the Entropy Regularized Stochastic Optimal Control setting which is often addressed lately by the Reinforcement Learning (RL) community. We analyze the theoretical convergence behavior of the theoretical state trajectory distribution sequence and draw connections with stochastic search methods tailored to classic optimization problems. Finally we derive explicit updates and compare the implied Entropy Regularized PIC with earlier work in the context of both PIC and RL for derivative-free trajectory optimization.

show abstract

“…Since a temporal logic reward (described in the next section) depends on the entire trajectory, it doesn't have the notion of cost-togo and can only be evaluated as a terminal reward. Therefore p(τ i ) (written short as p i ) is computed once and used for updates of all θ t (similar approach used in episodic PI-REPS [4]). The resulting update equations are…”

Section: B Relative Entropy Policy Searchmentioning

confidence: 99%

Reinforcement learning with temporal logic rewards

Vasile

Belta

2017

2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

128

115

View full text Add to dashboard Cite

Abstract-Reinforcement learning (RL) depends critically on the choice of reward functions used to capture the desired behavior and constraints of a robot. Usually, these are handcrafted by a expert designer and represent heuristics for relatively simple tasks. Real world applications typically involve more complex tasks with rich temporal and logical structure. In this paper we take advantage of the expressive power of temporal logic (TL) to specify complex rules the robot should follow, and incorporate domain knowledge into learning. We propose Truncated Linear Temporal Logic (TLTL) as specifications language, that is arguably well suited for the robotics applications, together with quantitative semantics, i.e., robustness degree. We propose a RL approach to learn tasks expressed as TLTL formulae that uses their associated robustness degree as reward functions, instead of the manually crafted heuristics trying to capture the same specifications. We show in simulated trials that learning is faster and policies obtained using the proposed approach outperform the ones learned using heuristic rewards in terms of the robustness degree, i.e., how well the tasks are satisfied. Furthermore, we demonstrate the proposed RL approach in a toast-placing task learned by a Baxter robot.

show abstract

Policy Search for Path Integral Control

Cited by 30 publications

References 12 publications

Dual REPS: A Generalization of Relative Entropy Policy Search Exploiting Bad Experiences

Dual REPS: A Generalization of Relative Entropy Policy Search Exploiting Bad Experiences

On Entropy Regularized Path Integral Control for Trajectory Optimization

Reinforcement learning with temporal logic rewards

Contact Info

Product

Resources

About