Robotics: Science and Systems XIII 2017
DOI: 10.15607/rss.2017.xiii.053
|View full text |Cite
|
Sign up to set email alerts
|

Active Preference-Based Learning of Reward Functions

Abstract: Abstract-Our goal is to efficiently learn reward functions encoding a human's preferences for how a dynamical system should act. There are two challenges with this. First, in many problems it is difficult for people to provide demonstrations of the desired system trajectory (like a high-DOF robot arm motion or an aggressive driving maneuver), or to even assign how much numerical reward an action or trajectory should get. We build on work in label ranking and propose to learn from preferences (or comparisons) i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
303
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 220 publications
(320 citation statements)
references
References 15 publications
2
303
0
Order By: Relevance
“…In contrast to related POMDP formulations, the exploration-exploitation trade-off is not addressed yet and is encoded only by a linear combination of objectives in the reward function. By contrast, the weights of the reward function can also be found by having a human driver choose a preferred trajectory iteratively from a set of two candidate trajectories (109). This allows the vehicle to learn the reward function without a set of expert trajectories and predefined labels.…”
Section: Wwwannualreviewsorg • Decision-making For Autonomousmentioning
confidence: 99%
“…In contrast to related POMDP formulations, the exploration-exploitation trade-off is not addressed yet and is encoded only by a linear combination of objectives in the reward function. By contrast, the weights of the reward function can also be found by having a human driver choose a preferred trajectory iteratively from a set of two candidate trajectories (109). This allows the vehicle to learn the reward function without a set of expert trajectories and predefined labels.…”
Section: Wwwannualreviewsorg • Decision-making For Autonomousmentioning
confidence: 99%
“…These demonstrations are used to generate queries, to which the human responds by selecting her preferred trajectory (bottom). learning (IRL) [1,35,46], where we learn a reward function directly from expert demonstrations of the task, and preference-based learning [17,39], where we learn a reward function by repeatedly asking a human to pick between two trajectories. While these methods have found some success, they still struggle in practice, especially in robotics.…”
Section: Introductionmentioning
confidence: 99%
“…Note that we do not consider these a contribution of our work: we choose the simplest approximations that facilitate tractability. There are many methods for approximate inference of θ studied in the literature that could be used for the joint (θ, β) spaces as well, from Metropolis Hastings [16], [39], to acquiring an MLE only via importance sampling of the partition function [6] or via a Laplace approximation [40].…”
Section: B Approximationmentioning
confidence: 99%