Combining the benefits of function approximation and trajectory optimization

Mordatch, Igor; Todorov, Emo

doi:10.15607/rss.2014.x.052

Cited by 85 publications

(88 citation statements)

References 33 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where the covariance matrix has diagonal entries corresponding to the typical disturbance that the respective state component may encounter. The sampling idea is conceptually similar to fitting the tangent space of the demonstrator policy instead of just the nominal control command [16]. Unfortunately, despite our efforts to extract samples from MPC that cover a large volume in state space, there is still a bias of the state distribution towards those states that are encountered by the optimal MPC policy.…”

Section: Sampling From An Mpc Solutionmentioning

confidence: 99%

MPC-Net: A First Principles Guided Policy Search

Carius

Farshidian

Hutter

2020

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

We present an Imitation Learning approach for the control of dynamical systems with a known model. Our policy search method is guided by solutions from Model Predictive Control (MPC). Contrary to approaches that minimize a distance metric between the guiding demonstrations and the learned policy, our loss function corresponds to the minimization of the control Hamiltonian, which derives from the principle of optimality. Our algorithm, therefore, directly attempts to solve the Hamilton-Jacobi-Bellman (HJB) optimality equation with a parameterized class of control laws. The loss function's explicit encoding of physical constraints manifests in an improved constraint satisfaction metric of the learned controller. We train a mixture-of-expert neural network architecture for controlling a quadrupedal robot and show that this policy structure is well suited for such multimodal systems. The learned policy can successfully stabilize different gaits on the real walking robot from less than 10 min of demonstration data.

show abstract

Section: Sampling From An Mpc Solutionmentioning

confidence: 99%

MPC-Net: A First Principles Guided Policy Search

Carius

Farshidian

Hutter

2020

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

show abstract

“…Characteristic for most GPS approaches is that the trajectories and global policy are jointly optimized in an alternating fashion with a penalty on discrepancies. A deterministic variant was proposed by Mordatch and Todorov (2014). They also noted a connection to directly solving the deterministic trajectory optimization problem in Equation (3.9).…”

Section: Trajectory-guided Policy Learningmentioning

confidence: 99%

Methods for Scalable and Safe Robot Learning

Andersson¹

2017

View full text Add to dashboard Cite

“…The constraints and the objectives will ensure that the behaviors of the system over the time horizon will satisfy the properties of interest, while optimizing some key performance metrics. A common solution to this problem lies in training a policy that "mimics" the input-output map of the MPC [14], [20]. Instead, our approach is based on two new ideas.…”

Section: Introductionmentioning

confidence: 99%

Formal Policy Learning from Demonstrations for Reachability Properties

Ravanbakhsh

Sankaranarayanan

Seshia

2019

2019 International Conference on Robotics and Automation (ICRA)

View full text Add to dashboard Cite

We consider the problem of learning structured, closed-loop policies (feedback laws) from demonstrations in order to control under-actuated robotic systems, so that formal behavioral specifications such as reaching a target set of states are satisfied. Our approach uses a "counterexample-guided" iterative loop that involves the interaction between a policy learner, a demonstrator and a verifier. The learner is responsible for querying the demonstrator in order to obtain the training data to guide the construction of a policy candidate. This candidate is analyzed by the verifier and either accepted as correct, or rejected with a counterexample. In the latter case, the counterexample is used to update the training data and further refine the policy.The approach is instantiated using receding horizon modelpredictive controllers (MPCs) as demonstrators. Rather than using regression to fit a policy to the demonstrator actions, we extend the MPC formulation with the gradient of the cost-to-go function evaluated at sample states in order to constrain the set of policies compatible with the behavior of the demonstrator. We demonstrate the successful application of the resulting policy learning schemes on two case studies and we show how simple, formally-verified policies can be inferred starting from a complex and unverified nonlinear MPC implementations. As a further benefit, the policies are many orders of magnitude faster to implement when compared to the original MPCs.

show abstract

Combining the benefits of function approximation and trajectory optimization

Cited by 85 publications

References 33 publications

MPC-Net: A First Principles Guided Policy Search

MPC-Net: A First Principles Guided Policy Search

Methods for Scalable and Safe Robot Learning

Formal Policy Learning from Demonstrations for Reachability Properties

Contact Info

Product

Resources

About