2019
DOI: 10.48550/arxiv.1905.11108
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards

Abstract: Learning to imitate expert behavior from demonstrations can be challenging, especially in environments with high-dimensional, continuous observations and unknown dynamics. Supervised learning methods based on behavioral cloning (BC) suffer from distribution shift: because the agent greedily imitates demonstrated actions, it can drift away from demonstrated states due to error accumulation. Recent methods based on reinforcement learning (RL), such as inverse RL and generative adversarial imitation learning (GAI… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
50
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 37 publications
(52 citation statements)
references
References 15 publications
2
50
0
Order By: Relevance
“…We conduct experiments on MineRL environment [Guss et al, 2019]. Our approach is built based on existing RL algorithms including SQIL [Reddy et al, 2019], PPO [Schulman et al, 2017], DQfD [Hester et al, 2018]. We present the details of our approach as well as the experiment settings in Appendix 5.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…We conduct experiments on MineRL environment [Guss et al, 2019]. Our approach is built based on existing RL algorithms including SQIL [Reddy et al, 2019], PPO [Schulman et al, 2017], DQfD [Hester et al, 2018]. We present the details of our approach as well as the experiment settings in Appendix 5.…”
Section: Methodsmentioning
confidence: 99%
“…Baselines. We compare our approach with several baselines in two categories: 1) end-to-end learning methods used in prior work and in the official implementation provided, including BC [Kanervisto et al, 2020], SQIL [Reddy et al, 2019], Rainbow [Hessel et al, 2018], DQfD [Hester et al, 2018], PDDDQN [Schaul et al, 2015;Van Hasselt et al, 2016;Wang et al, 2016]; all these baselines are trained within 8 million samples with default hyper-parameters.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The discriminator is crucial for D2-Imitation to work properly and guarantees convergence to expert performance. We perform ablations on the discriminator and compare the training performance of D2-Imitation with one variant: one without any discriminator (denoted without-discriminator), which just puts on-policy samples to B 0 , assigns 0 reward to them and gives +1 reward to demonstration samples in B + , an idea adopted in Soft-Q Imitation Learning (SQIL) (Reddy, Dragan, and Levine 2019). Figure 5 inator is applied.…”
Section: Ablation Experimentsmentioning
confidence: 99%
“…We call this Supervised Negative Q-learning (SNQN). Another interpretation of negative sampling in RL is imitation learning under sparse reward settings [23]. Different from SQN, which only performs RL on positive actions (clicks, views, etc.…”
Section: Introductionmentioning
confidence: 99%