2011
DOI: 10.1007/978-3-642-23780-5_11
|View full text |Cite
|
Sign up to set email alerts
|

Preference-Based Policy Learning

Abstract: Abstract. Many machine learning approaches in robotics, based on reinforcement learning, inverse optimal control or direct policy learning, critically rely on robot simulators. This paper investigates a simulatorfree direct policy learning, called Preference-based Policy Learning (PPL). PPL iterates a four-step process: the robot demonstrates a candidate policy; the expert ranks this policy comparatively to other ones according to her preferences; these preferences are used to learn a policy return estimate; t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
58
0

Year Published

2012
2012
2020
2020

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 59 publications
(60 citation statements)
references
References 16 publications
0
58
0
Order By: Relevance
“…The idea of preference-based reinforcement learning was introduced simultaneously and independently in Akrour et al (2011) and Cheng et al (2011). While preferences on histories are taken as a point of departure in both approaches, policy learning is accomplished in different ways.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…The idea of preference-based reinforcement learning was introduced simultaneously and independently in Akrour et al (2011) and Cheng et al (2011). While preferences on histories are taken as a point of departure in both approaches, policy learning is accomplished in different ways.…”
Section: Related Workmentioning
confidence: 99%
“…The preference-based policy learning settings considered in Fürnkranz et al (2012), Akrour et al (2011) proceed from a (possibly partial) preference relation over histories h ∈ H (T ) , and the goal is to find a policy which tends to generate preferred histories with high probability. In this regard, it is notable that, in the EDPS framework, the precise values of the function to be optimized (in this case the expected total rewards) are actually not used by the evolutionary optimizer.…”
Section: Ordinal Decision Modelsmentioning
confidence: 99%
See 3 more Smart Citations