2021
DOI: 10.48550/arxiv.2109.04307
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching

Abstract: Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious. However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance. This limits IRL applications in the real world, where environment interactions can become highly expensive. To tackle this problem, we present Off-Policy Inverse Reinforcement Learning (OPIRL), which (1) adopts off-policy data distribution instead of on-policy… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 24 publications
(34 reference statements)
0
1
0
Order By: Relevance
“…As a result, the off-policy update of the adversarial training structure would be less stable . What's worse, this off-policy regime is likely to over-fit to training data, leading to severe training instability or even failures of imitation as shown in Rafailov et al (2021); Hoshino et al (2021). In OPIfVI, to enhance the training stability against the drawbacks of off-policy learning, we employ spectral normalization (Miyato et al, 2018;Cheng et al, 2021) to force the discriminator to be local Lipschitz-continuous.…”
Section: Off-policy Learningmentioning
confidence: 99%
“…As a result, the off-policy update of the adversarial training structure would be less stable . What's worse, this off-policy regime is likely to over-fit to training data, leading to severe training instability or even failures of imitation as shown in Rafailov et al (2021); Hoshino et al (2021). In OPIfVI, to enhance the training stability against the drawbacks of off-policy learning, we employ spectral normalization (Miyato et al, 2018;Cheng et al, 2021) to force the discriminator to be local Lipschitz-continuous.…”
Section: Off-policy Learningmentioning
confidence: 99%