2022
DOI: 10.48550/arxiv.2203.02214
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization

Abstract: Recent progress in state-only imitation learning extends the scope of applicability of imitation learning to real-world settings by relieving the need for observing expert actions. However, existing solutions only learn to extract a state-to-action mapping policy from the data, without considering how the expert plans to the target. This hinders the ability to leverage demonstrations and limits the flexibility of the policy. In this paper, we introduce Decoupled Policy Optimization (DePO), which explicitly dec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 7 publications
0
3
0
Order By: Relevance
“…Each trajectory τ e is composed of a sequence of observations τ e = {o e t } T t=1 . A prevalent approach to address the ILfO problem involves transforming it into a Reinforcement Learning (RL) problem by defining proxy rewards based on the agent's trajectory τ and the expert demonstrations: {r t } T −1 t=1 := f r (τ, D e ), where f r represents a criterion for reward assignment (Torabi et al, 2018b;Yang et al, 2019;Lee et al, 2021;Jaegle et al, 2021;Liu et al, 2022;Huang et al, 2023). Subsequently, RL is employed to maximize the expected discounted sum of rewards:…”
Section: Imitation Through Proxy Rewardsmentioning
confidence: 99%
See 2 more Smart Citations
“…Each trajectory τ e is composed of a sequence of observations τ e = {o e t } T t=1 . A prevalent approach to address the ILfO problem involves transforming it into a Reinforcement Learning (RL) problem by defining proxy rewards based on the agent's trajectory τ and the expert demonstrations: {r t } T −1 t=1 := f r (τ, D e ), where f r represents a criterion for reward assignment (Torabi et al, 2018b;Yang et al, 2019;Lee et al, 2021;Jaegle et al, 2021;Liu et al, 2022;Huang et al, 2023). Subsequently, RL is employed to maximize the expected discounted sum of rewards:…”
Section: Imitation Through Proxy Rewardsmentioning
confidence: 99%
“…Some approaches train an inverse dynamics model by the agent's collected data, and use this model to infer the expert's missing action information (Nair et al, 2017;Torabi et al, 2018a;Pathak et al, 2018;Radosavovic et al, 2021). Recent work also integrates the inverse dynamics model with proxy-reward-based algorithms (Liu et al, 2022;Ramos et al, 2023). Taking a different approach, Edwards et al (2019) learns a forward dynamics model on a latent action space.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation