2021
DOI: 10.1177/02783649211041652
|View full text |Cite
|
Sign up to set email alerts
|

Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

Abstract: Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied rewar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
22
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 46 publications
(26 citation statements)
references
References 32 publications
0
22
0
Order By: Relevance
“…Prior work has explored learning from expert behaviour and preferences (Ibarz et al, 2018;Palan et al, 2019;Bıyık et al, 2022;Koppol et al, 2020), or other multi-modal data sources (Tung et al, 2018;Jeon et al, 2020). One motivation is that different data sources may provide complementary reward information (Koppol et al, 2020), decreasing ambiguity.…”
Section: Related Workmentioning
confidence: 99%
“…Prior work has explored learning from expert behaviour and preferences (Ibarz et al, 2018;Palan et al, 2019;Bıyık et al, 2022;Koppol et al, 2020), or other multi-modal data sources (Tung et al, 2018;Jeon et al, 2020). One motivation is that different data sources may provide complementary reward information (Koppol et al, 2020), decreasing ambiguity.…”
Section: Related Workmentioning
confidence: 99%
“…Hence it is natural to integrate preference and action demonstration via a joint IRL framework (Palan et al, 2019;Bıyık et al, 2020), with a nice insight that these two sources of information are complementary under the IRL framework: "demonstrations provide a high-level initialization of the human's overall reward functions, while preferences explore specific, fine-grained aspects of it" (Bıyık et al, 2020). Therefore they use demonstrations to initialize a reward distribution, and refine the reward function with preference queries (Palan et al, 2019;Bıyık et al, 2020). Ibarz et al (2018) takes a different approach to combine demonstration and preference information, by using human demonstrations to pre-train the agent.…”
Section: Learning From Human Preferencementioning
confidence: 99%
“…Under this framework, it is possible to develop a unified learning paradigm that accepts multiple types of human guidance. We start to notice efforts towards this goal (Abel et al, 2017;Waytowich et al, 2018;Goecks et al, 2019;Woodward et al, 2020;Najar et al, 2020;Bıyık et al, 2020).…”
Section: A Unified Learning Frameworkmentioning
confidence: 99%
“…Influential recent research has focused on reward learning from preferences over pairs of fixed-length trajectory segments. Nearly all of this recent work assumes that human preferences arise probabilistically from only the sum of rewards over a segment, i.e., the segment's partial return [9][10][11][12][13][14][15][16]. That is, these works assume that people tend to prefer trajectory segments that yield greater rewards during the segment.…”
Section: Introductionmentioning
confidence: 99%