Optimal control and inverse optimal control by distribution matching

Arenz, Oleg; Abdulsamad, Hany; Neumann, Gerhard

doi:10.1109/iros.2016.7759596

Cited by 4 publications

(5 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The trajectory distribution p(τ ) can be considered as a special case of the feature distribution. Behavior cloning methods such as , Englert et al, 2013 and inverse reinforcement learning methods such as [Arenz et al, 2016] use feature distributions.…”

Section: Trajectory Feature Distributionmentioning

confidence: 99%

“…Employed by Maximum margin [Ng and Russell, 2000,a, 2009, Zucker et al, 2011] Maximum entropy , Ramachandran and Amir, 2007, Choi and Kim, 2011b, Kitani et al, 2012, Shiarlis et al, 2016, Finn et al, 2016b] Other [Doerr et al, 2015, Arenz et al, 2016 a nonlinear reward function. On the other hand, IRL with the reward function nonlinear to the features is more challenging than IRL with the linear reward functions.…”

Section: Objectivesmentioning

confidence: 99%

“…Since the trajectories sampled from the actual system follow the system dynamics, we can consider that the expected feature counts approximated using importance sampling implicitly encode the system dynamics. Arenz et al [2016] use the M-projection to obtain the data state distribution analytically, and then use the I-projection to obtain the policy given the analytic model of the data distribution. Methods that directly try to minimize the KL to the data distribution D KL (p(τ )||q demo (τ )), where q demo (τ ) is the trajectory distribution induced by the expert policy, have not been widely researched in imitation learning to our knowledge.…”

Section: Interpretation Of Irl With the Maximum Entropy Principlementioning

confidence: 99%

See 2 more Smart Citations

An Algorithmic Perspective on Imitation Learning

et al. 2018

Self Cite

View full text Add to dashboard Cite

As robots and other intelligent agents move from simple environments and problems to more complex, unstructured settings, manually programming their behavior has become increasingly challenging and expensive. Often, it is easier for a teacher to demonstrate a desired behavior rather than attempt to manually engineer it. This process of learning from demonstrations, and the study of algorithms to do so, is called imitation learning. This work provides an introduction to imitation learning. It covers the underlying assumptions, approaches, and how they relate; the rich set of algorithms developed to tackle the problem; and advice on effective tools and implementation.We intend this paper to serve two audiences. First, we want to familiarize machine learning experts with the challenges of imitation learning, particularly those arising in robotics, and the interesting theoretical and practical distinctions between it and more familiar frameworks like statistical supervised learning theory and reinforcement learning. Second, we want to give roboticists and experts in applied artificial intelligence a broader appreciation for the frameworks and tools available for imitation learning.We organize our work by dividing imitation learning into directly replicating desired behavior (sometimes called behavioral cloning [Bain and Sammut, 1996]) and learning the hidden objectives of the desired behavior from demonstrations (called inverse optimal control [Kalman, 1964] or inverse reinforcement learning [Russell, 1998]). In addition to method analysis, we discuss the design decisions a practitioner must make when selecting an imitation learning approach. Moreover, application examples-such as robots that play table tennis [Kober and Peters, 2009] and programs that play the game of Go [Silver et al., 2016]-illustrate the properties and motivations behind different forms of imitation learning. We conclude by presenting a set of open questions and point towards possible future research directions.

show abstract

Section: Trajectory Feature Distributionmentioning

confidence: 99%

Section: Objectivesmentioning

confidence: 99%

Section: Interpretation Of Irl With the Maximum Entropy Principlementioning

confidence: 99%

See 1 more Smart Citation

An Algorithmic Perspective on Imitation Learning

et al. 2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…On one hand, it regularizes the policy optimization step, which is crucial for convergence of the overall minimax problem. On the other hand, it offers a tractable maximum-entropy SOC framework for dealing with nonlinear dynamics through iterative linearization [21], [23]. To summarize, for every iteration k, we iterate over the updates of the worst-case distribution and its respective optimal policy, for more details see Algorithm 1.…”

Section: Problem Formulationmentioning

confidence: 99%

Distributionally Robust Trajectory Optimization Under Uncertain Dynamics via Relative Entropy Trust-Regions

Abdulsamad¹,

Dorau²,

Belousov³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Trajectory optimization and model predictive control are essential techniques underpinning advanced robotic applications, ranging from autonomous driving to full-body humanoid control. State-of-the-art algorithms have focused on data-driven approaches that infer the system dynamics online and incorporate posterior uncertainty during planning and control. Despite their success, such approaches are still susceptible to catastrophic errors that may arise due to statistical learning biases, unmodeled disturbances or even directed adversarial attacks. In this paper, we tackle the problem of dynamics mismatch and propose a distributionally robust optimal control formulation that alternates between two relative-entropy trust region optimization problems. Our method finds the worstcase maximum-entropy Gaussian posterior over the dynamics parameters and the corresponding robust optimal policy. We show that our approach admits a closed-form backward-pass for a certain class of systems and demonstrate the resulting robustness on linear and nonlinear numerical examples.{ * } Equal Contribution.

show abstract

“…Both areas of research are often formalized as distribution-matching, that is, the learned policy (or the optimal policy for IRL) should induce a distribution over states and actions that is close to the expert's distribution with respect to a given (usually non-metric) distance. Commonly applied distances are the forward Kullback-Leibler (KL) divergence (e.g., Ziebart, 2010), which maximizes the likelihood of the demonstrated state-action pairs under the agent's distribution, and the reverse Kullback-Leibler (RKL) divergence (e.g., Arenz et al, 2016;Fu et al, 2018;Ghasemipour et al, 2020) which minimizes the expected discrimination information (Kullback and Leibler, 1951) of state-action pairs sampled from the agent's distribution. However, since the emergence of generative adversarial networks (GANs, Goodfellow et al, 2014) as a solution technique for both areas, other divergences have been investigated such as the Jensen-Shannon divergence (Ho and Ermon, 2016), the Wasserstein distance (Xiao et al, 2019) and general f -divergences (Ke et al, 2019;Ghasemipour et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

Non-Adversarial Imitation Learning and its Connections to Adversarial Methods

Arenz,

Neumann

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Many modern methods for imitation learning and inverse reinforcement learning, such as GAIL or AIRL, are based on an adversarial formulation. These methods apply GANs to match the expert's distribution over states and actions with the implicit state-action distribution induced by the agent's policy. However, by framing imitation learning as a saddle point problem, adversarial methods can suffer from unstable optimization, and convergence can only be shown for small policy updates. We address these problems by proposing a framework for non-adversarial imitation learning. The resulting algorithms are similar to their adversarial counterparts and, thus, provide insights for adversarial imitation learning methods. Most notably, we show that AIRL is an instance of our non-adversarial formulation, which enables us to greatly simplify its derivations and obtain stronger convergence guarantees. We also show that our non-adversarial formulation can be used to derive novel algorithms by presenting a method for offline imitation learning that is inspired by the recent ValueDice algorithm, but does not rely on small policy updates for convergence. In our simulated robot experiments, our offline method for non-adversarial imitation learning seems to perform best when using many updates for policy and discriminator at each iteration and outperforms behavioral cloning and ValueDice.

show abstract

Optimal control and inverse optimal control by distribution matching

Cited by 4 publications

References 13 publications

An Algorithmic Perspective on Imitation Learning

An Algorithmic Perspective on Imitation Learning

Distributionally Robust Trajectory Optimization Under Uncertain Dynamics via Relative Entropy Trust-Regions

Non-Adversarial Imitation Learning and its Connections to Adversarial Methods

Contact Info

Product

Resources

About