Data-efficient Deep Reinforcement Learning for Dexterous Manipulation

Popov, Ivaylo; Heess, Nicolas; Lillicrap, Timothy P.; Hafner, Roland; Barth-Maron, Gabriel; Vecerík, Matej; Lampe, Thomas; Tassa, Yuval; Erez, Tom; Riedmiller, Martin

doi:10.48550/arxiv.1704.03073

Cited by 49 publications

(67 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In RL, reward shaping is used to reshape the original reward function to better guide the direction of the gradient update [30]. Prior knowledge about the environment is needed to formalize a reliable reward shaping function to avoid otherwise to bias learning [39].…”

Section: Unbiased Reward Shapingmentioning

confidence: 99%

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Zhuang¹,

Qiao²,

Zuccon³

2022

Preprint

View full text Add to dashboard Cite

Online learning to rank (OLTR) aims to learn a ranker directly from implicit feedback derived from users' interactions, such as clicks. Clicks however are a biased signal: specifically, top-ranked documents are likely to attract more clicks than documents down the ranking (position bias). In this paper, we propose a novel learning algorithm for OLTR that uses reinforcement learning to optimize rankers: Reinforcement Online Learning to Rank (ROLTR). In ROLTR, the gradients of the ranker are estimated based on the rewards assigned to clicked and unclicked documents. In order to de-bias the users' position bias contained in the reward signals, we introduce unbiased reward shaping functions that exploit inverse propensity scoring for clicked and unclicked documents. The fact that our method can also model unclicked documents provides a further advantage in that less users interactions are required to effectively train a ranker, thus providing gains in efficiency. Empirical evaluation on standard OLTR datasets shows that ROLTR achieves state-ofthe-art performance, and provides significantly better user experience than other OLTR approaches. To facilitate the reproducibility of our experiments, we make all experiment code available at https://github.com/ielab/OLTR.

show abstract

Section: Unbiased Reward Shapingmentioning

confidence: 99%

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Zhuang¹,

Qiao²,

Zuccon³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, every deadlock that may arise during the previously described optimization scheme should have been predicted, and a corresponding mitigation plan should have been already in place [Palacios-Gasós et al 2016]; otherwise, the robot is going to be stuck in this locally optimal configuration. On top of that, to engineer a multi-term strategy that reflects the task at hand is not always trivial [Popov et al 2017].…”

Section: Related Workmentioning

confidence: 99%

MarsExplorer: Exploration of Unknown Terrains via Deep Reinforcement Learning and Procedurally Generated Environments

Koutras¹,

Kapoutsis²,

Amanatiadis³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper is an initial endeavor to bridge the gap between powerful Deep Reinforcement Learning methodologies and the problem of exploration/coverage of unknown terrains. Within this scope, MarsExplorer, an openai-gym compatible environment tailored to exploration/coverage of unknown areas, is presented. MarsExplorer translates the original robotics problem into a Reinforcement Learning setup that various off-the-shelf algorithms can tackle. Any learned policy can be straightforwardly applied to a robotic platform without an elaborate simulation model of the robot's dynamics to apply a different learning/adaptation phase. One of its core features is the controllable multi-dimensional procedural generation of terrains, which is the key for producing policies with strong generalization capabilities. Four different stateof-the-art RL algorithms (A3C, PPO, Rainbow, and SAC) are trained on the MarsExplorer environment, and a proper evaluation of their results compared to the average humanlevel performance is reported. In the follow-up experimental analysis, the effect of the multi-dimensional difficulty setting on the learning capabilities of the best-performing algorithm (PPO) is analyzed. A milestone result is the generation of an exploration policy that follows the Hilbert curve without providing this information to the environment or rewarding directly or indirectly Hilbert-curve-like trajectories. The experimental analysis is concluded by comparing PPO learned policy results with frontier-based exploration context for extended terrain sizes. The source code can be found at: https://github.com/dimikout3/GeneralExplorationPolicy.

show abstract

“…Deep Reinforcement learning (D-RL) has been used to learn controllers for a variety of tasks ranging from walking robots [6], [7], [8] to manipulating objects with an arm [9], [10], [11], [12], [13]. Hence reinforcement learning, indeed, offers a way to realize peg-in-hole tasks via random explorations, thereby eliminating the need to hand craft an effective control/policy without using any form of expert data.…”

Section: Related Workmentioning

confidence: 99%

Imitation Learning for High Precision Peg-in-Hole Tasks

Gubbi,

Kolathaya,

Amrutur

2021

Preprint

View full text Add to dashboard Cite

Industrial robot manipulators are not able to match the precision and speed with which humans are able to execute contact rich tasks even to this day. Therefore, as a means overcome this gap, we demonstrate generative methods for imitating a peg-in-hole insertion task in a 6-DOF robot manipulator. In particular, generative adversarial imitation learning (GAIL) is used to successfully achieve this task with a 6 µm peg-hole clearance on the Yaskawa GP8 industrial robot. Experimental results show that the policy successfully learns within 20 episodes from a handful of human expert demonstrations on the robot (i.e., < 10 tele-operated robot demonstrations). The insertion time improves from > 20 seconds (which also includes failed insertions) to < 15 seconds, thereby validating the effectiveness of this approach.

show abstract

Data-efficient Deep Reinforcement Learning for Dexterous Manipulation

Cited by 49 publications

References 0 publications

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

MarsExplorer: Exploration of Unknown Terrains via Deep Reinforcement Learning and Procedurally Generated Environments

Imitation Learning for High Precision Peg-in-Hole Tasks

Contact Info

Product

Resources

About