Deep Q-learning from Demonstrations

Hester, Todd; Vecerík, Matej; Pietquin, Olivier; Lanctot, Marc; Schaul, Tom; Piot, Bilal; Horgan, Dan; Quan, John; Sendonaris, A.; Dulac-Arnold, Gabriel; Osband, Ian; Agapiou, John S.; Leibo, Joel Z.; Gruslys, Audrūnas

doi:10.48550/arxiv.1704.03732

Cited by 32 publications

(50 citation statements)

References 3 publications

(3 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reinforcement learning with demonstrations can exploit the strength of RL and IL and overcome their respective weaknesses, leading to wide usage in complex robotics controlling tasks. In such frameworks, demonstrations mainly function as an initialization tool to boost RL policies, and expect the following exploration process to find a better policy than that of the supervisor [20], [22], [23]. Aside from bootstrapping exploration in RL, demonstrations can be used to infer the reward function in reinforcement learning [24], which belongs to the branch of inverse reinforcement learning (IRL) and will not be covered in this study.…”

Section: Related Workmentioning

confidence: 99%

Self-Imitation Learning by Planning

Luo¹,

Kasaei²,

Schomaker³

2021

Preprint

View full text Add to dashboard Cite

Imitation learning (IL) enables robots to acquire skills quickly by transferring expert knowledge, which is widely adopted in reinforcement learning (RL) to initialize exploration. However, in long-horizon motion planning tasks, a challenging problem in deploying IL and RL methods is how to generate and collect massive, broadly distributed data such that these methods can generalize effectively. In this work, we solve this problem using our proposed approach called self-imitation learning by planning (SILP), where demonstration data are collected automatically by planning on the visited states from the current policy. SILP is inspired by the observation that successfully visited states in the early reinforcement learning stage are collision-free nodes in the graph-search based motion planner, so we can plan and relabel robot's own trials as demonstrations for policy learning. Due to these self-generated demonstrations, we relieve the human operator from the laborious data preparation process required by IL and RL methods in solving complex motion planning tasks. The evaluation results show that our SILP method achieves higher success rates and enhances sample efficiency compared to selected baselines, and the policy learned in simulation performs well in a real-world placement task with changing goals and obstacles.

show abstract

Section: Related Workmentioning

confidence: 99%

Self-Imitation Learning by Planning

Luo¹,

Kasaei²,

Schomaker³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…For tasks with sparse rewards (such as the insertion task considered in this paper), RL algorithms (e.g., DDPG [13]) converge slowly and are not data-efficient due to the difficulty to explore non-zero rewards. To address the problem, we may add demonstration data into the replay buffer [14,15,16,17]. However, these algorithms either need a large amount of demonstration data to balance the data distribution, or may still diverge due to the difficulty in exploration.…”

Section: Related Workmentioning

confidence: 99%

Tolerance-Guided Policy Learning for Adaptable and Transferrable Delicate Industrial Insertion

Niu¹,

Wang²,

Liu³

2021

Preprint

View full text Add to dashboard Cite

Policy learning for delicate industrial insertion tasks (e.g., PC board assembly) is challenging. This paper considers two major problems: how to learn a diversified policy (instead of just one average policy) that can efficiently handle different workpieces with minimum amount of training data, and how to handle defects of workpieces during insertion. To address the problems, we propose tolerance-guided policy learning. To encourage transferability of the learned policy to different workpieces, we add a task embedding to the policy's input space using the insertion tolerance. Then we train the policy using generative adversarial imitation learning with reward shaping (RS-GAIL) on a variety of representative situations. To encourage adaptability of the learned policy to handle defects, we build a probabilistic inference model that can output the best inserting pose based on failed insertions using the tolerance model. The best inserting pose is then used as a reference to the learned policy. This proposed method is validated on a sequence of IC socket insertion tasks in simulation. The results show that 1) RS-GAIL can efficiently learn optimal policies under sparse rewards; 2) the tolerance embedding can enhance the transferability of the learned policy; 3) the probabilistic inference makes the policy robust to defects on the workpieces.

show abstract

“…The HAT algorithm (Taylor et al, 2011) introduces an intermediate policy summarization step, in which the demonstrated data is translated into an approximate policy that is then used to bias exploration in a final RL stage. In Hester et al (2017), the policy is simultaneously trained on expert data and collected data, using a combination of supervised and temporal difference losses. In Salimans & Chen (2018), the RL agent is at the start of each episode reset to a state in the single demonstration.…”

Section: Demonstration-and Plan-based Reward Shapingmentioning

confidence: 99%

Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks

Schubert,

Oguz,

Toussaint

2021

Preprint

View full text Add to dashboard Cite

In high-dimensional state spaces, the usefulness of Reinforcement Learning (RL) is limited by the problem of exploration. This issue has been addressed using potential-based reward shaping (PB-RS) previously. In the present work, we introduce Final-Volume-Preserving Reward Shaping (FV-RS). FV-RS relaxes the strict optimality guarantees of PB-RS to a guarantee of preserved long-term behavior. Being less restrictive, FV-RS allows for reward shaping functions that are even better suited for improving the sample efficiency of RL algorithms. In particular, we consider settings in which the agent has access to an approximate plan. Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed significantly improve the sample efficiency of RL over plan-based PB-RS.

show abstract

Deep Q-learning from Demonstrations

Cited by 32 publications

References 3 publications

Self-Imitation Learning by Planning

Self-Imitation Learning by Planning

Tolerance-Guided Policy Learning for Adaptable and Transferrable Delicate Industrial Insertion

Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks

Contact Info

Product

Resources

About