Preference-Based Policy Learning

Akrour, Riad; Schoenauer, Marc; Sebag, Michèle

doi:10.1007/978-3-642-23780-5_11

Cited by 59 publications

(60 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The idea of preference-based reinforcement learning was introduced simultaneously and independently in Akrour et al (2011) and Cheng et al (2011). While preferences on histories are taken as a point of departure in both approaches, policy learning is accomplished in different ways.…”

Section: Related Workmentioning

confidence: 99%

“…The preference-based policy learning settings considered in Fürnkranz et al (2012), Akrour et al (2011) proceed from a (possibly partial) preference relation over histories h ∈ H (T ) , and the goal is to find a policy which tends to generate preferred histories with high probability. In this regard, it is notable that, in the EDPS framework, the precise values of the function to be optimized (in this case the expected total rewards) are actually not used by the evolutionary optimizer.…”

Section: Ordinal Decision Modelsmentioning

confidence: 99%

“…As a direct policy search method, it shares commonalities with Akrour et al (2011), but also differs in several respects. In particular, the latter approach (as well as follow-up work of the same authors, such as Akrour et al (2012)) is specifically tailored for applications in which a user interacts with the learner in an iterative process.…”

Section: Introductionmentioning

confidence: 97%

“…In Akrour et al (2011) and Cheng et al (2011), the authors tackle the problem of learning policies solely on the basis of qualitative preference information, namely pairwise comparisons between histories; such comparisons suggest that one system behavior is preferred to another one, but without committing to precise numerical rewards. Building on novel methods for preference learning, this is accomplished by providing the RL agent with qualitative policy models, such as ranking functions.…”

Section: Introductionmentioning

confidence: 99%

“…More specifically, Cheng et al (2011) use a method called label ranking to train a model that ranks actions given a state; their approach generalizes classification-based approximate policy iteration (Lagoudakis and Parr 2003). Instead of ranking actions given states, Akrour et al (2011) learn a preference model on histories, which can then be used for policy optimization.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm

et al. 2014

View full text Add to dashboard Cite

International audienceWe introduce a novel approach to preference-based reinforcement learn-ing, namely a preference-based variant of a direct policy search method based on evolutionary optimization. The core of our approach is a preference-based racing algorithm that selects the best among a given set of candidate policies with high probability. To this end, the algorithm operates on a suitable ordinal preference structure and only uses pairwise comparisons between sample rollouts of the policies. Embedding the racing algorithm in a rank-based evolutionary search procedure, we show that approxima-tions of the so-called Smith set of optimal policies can be produced with certain theoretical guarantees. Apart from a formal performance and complexity analysis, we present first experimental studies showing that our approach performs well in practice

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Ordinal Decision Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm

et al. 2014

View full text Add to dashboard Cite

show abstract

Preference-Based Reinforcement Learning Using Dyad Ranking

Schäfer

Hüllermeier

2018

Discovery Science

View full text Add to dashboard Cite

Interactive Learning with Corrective Feedback for Policies Based on Deep Neural Networks

Pérez-Dattari

Celemin

Ruiz‐del‐Solar

et al. 2020

Springer Proceedings in Advanced Robotics

View full text Add to dashboard Cite

Deep Reinforcement Learning (DRL) has become a powerful strategy to solve complex decision making problems based on Deep Neural Networks (DNNs). However, it is highly data demanding, so unfeasible in physical systems for most applications. In this work, we approach an alternative Interactive Machine Learning (IML) strategy for training DNN policies based on human corrective feedback, with a method called Deep COACH (D-COACH). This approach not only takes advantage of the knowledge and insights of human teachers as well as the power of DNNs, but also has no need of a reward function (which sometimes implies the need of external perception for computing rewards). We combine Deep Learning with the COrrective Advice Communicated by Humans (COACH) framework, in which non-expert humans shape policies by correcting the agent's actions during execution. The D-COACH framework has the potential to solve complex problems without much data or time required. Experimental results validated the efficiency of the framework in three different problems (two simulated, one with a real robot), with state spaces of low and high dimensions, showing the capacity to successfully learn policies for continuous action spaces like in the Car Racing and Cart-Pole problems faster than with DRL.

show abstract

Preference-Based Policy Learning

Cited by 59 publications

References 16 publications

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm

Preference-Based Reinforcement Learning Using Dyad Ranking

Interactive Learning with Corrective Feedback for Policies Based on Deep Neural Networks

Contact Info

Product

Resources

About