2010
DOI: 10.1007/978-3-642-16952-6_49
|View full text |Cite
|
Sign up to set email alerts
|

Dynamic Reward Shaping: Training a Robot by Voice

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
63
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 63 publications
(63 citation statements)
references
References 8 publications
0
63
0
Order By: Relevance
“…In this learning scenario, feedback can be restricted to express various intensities of approval and disapproval; such feedback is mapped to numeric "reward" that the agent uses to revise its behavior [2], [3], [8], [1], [9]. Compared to learning from demonstration, learning from human reward requires only a simple task-independent interface and may require less expertise and place less cognitive load on the trainer [10].…”
Section: A Learning From Human Rewardsmentioning
confidence: 99%
“…In this learning scenario, feedback can be restricted to express various intensities of approval and disapproval; such feedback is mapped to numeric "reward" that the agent uses to revise its behavior [2], [3], [8], [1], [9]. Compared to learning from demonstration, learning from human reward requires only a simple task-independent interface and may require less expertise and place less cognitive load on the trainer [10].…”
Section: A Learning From Human Rewardsmentioning
confidence: 99%
“…Reward and punishment are frequently received in a social context, from another social agent. In recent years, this form of communication and its machine-learning analog-reinforcement learning-have been adapted to permit teaching of artificial agents by their human users [4,14,6,13,11,10]. In this form of teaching-which we call interactive shaping-a user observes an agent's behavior while generating human reward instances through varying interfaces (e.g., keyboard, mouse, or verbal feedback); each instance is received by the learning agent as a time-stamped numeric value and used to inform future behavioral choices.…”
Section: Introductionmentioning
confidence: 99%
“…Investigating the six previous projects that we know to have involved learning from positively and negatively valued humangenerated reward [4,14,13,11,9,10] (including by email with corresponding authors), we identified a curious trend: all such projects have been much more myopic-i.e., using high discount rates-than is usual in RL. We hypothesized that a cause of this pattern is the general positivity of human reward.…”
Section: Introductionmentioning
confidence: 99%
“…Accordingly, other algorithms for learning from human reward [4,21,20,16,18,13] do not directly account for delay, do not model human reward explicitly, and are not fully myopic (i.e., they employ discount factors greater than 0).…”
Section: Background On Tamermentioning
confidence: 99%
“…Though a few past projects have considered this problem of learning from human reward [4,21,20,16,18,13,9], only two of these implemented their solution for a robotic agent. In one such project [13], the agent learned partially in simulation and from hardcoded reward, demonstrations, and human reward.…”
Section: Introductionmentioning
confidence: 99%