Learning from Extrapolated Corrections

Zhang, Jason Y.; Dragan, Anca D.

doi:10.1109/icra.2019.8793554

Cited by 10 publications

(10 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The concept of learning a hidden reward function from a user is widely used in various human-robot interaction frameworks, such as learning from demonstrations (LfD) [4], [17], learning from corrections [18], [19] and learning from preferences [1], [3], [12], [13], [17].…”

Section: A Related Workmentioning

confidence: 99%

Active Preference Learning using Maximum Regret

Wilde¹,

Kulić²,

Smith³

2020

Preprint

View full text Add to dashboard Cite

We study active preference learning as a framework for intuitively specifying the behaviour of autonomous robots. In active preference learning, a user chooses the preferred behaviour from a set of alternatives, from which the robot learns the user's preferences, modeled as a parameterized cost function. Previous approaches present users with alternatives that minimize the uncertainty over the parameters of the cost function. However, different parameters might lead to the same optimal behaviour; as a consequence the solution space is more structured than the parameter space. We exploit this by proposing a query selection that greedily reduces the maximum error ratio over the solution space. In simulations we demonstrate that the proposed approach outperforms other state of the art techniques in both learning efficiency and ease of queries for the user. Finally, we show that evaluating the learning based on the similarities of solutions instead of the similarities of weights allows for better predictions for different scenarios.

show abstract

Section: A Related Workmentioning

confidence: 99%

Active Preference Learning using Maximum Regret

Wilde¹,

Kulić²,

Smith³

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…However, expert demonstrations (with or without noise) are often difficult to obtain in real-world tasks. More recently, researchers start focusing on learning with nonexpert feedback on the queries of the robot's behaviors, often in the forms of ratings (Daniel et al 2014), comparisons (Dorsa Sadigh, Sastry, andSeshia 2017), or critiques (Cui and Niekum 2018;Zhang and Dragan 2019). All these prior works rely on an implicit assumption that the non-expert user maintains a correct understanding of the robot's domain dynamics.…”

Section: Related Workmentioning

confidence: 99%

What Is It You Really Want of Me? Generalized Reward Learning with Biased Beliefs about Domain Dynamics

Gong

Zhang

2020

AAAI

View full text Add to dashboard Cite

Reward learning as a method for inferring human intent and preferences has been studied extensively. Prior approaches make an implicit assumption that the human maintains a correct belief about the robot's domain dynamics. However, this may not always hold since the human's belief may be biased, which can ultimately lead to a misguided estimation of the human's intent and preferences, which is often derived from human feedback on the robot's behaviors. In this paper, we remove this restrictive assumption by considering that the human may have an inaccurate understanding of the robot. We propose a method called Generalized Reward Learning with biased beliefs about domain dynamics (GeReL) to infer both the reward function and human's belief about the robot in a Bayesian setting based on human ratings. Due to the complex forms of the posteriors, we formulate it as a variational inference problem to infer the posteriors of the parameters that govern the reward function and human's belief about the robot simultaneously. We evaluate our method in a simulated domain and with a user study where the user has a bias based on the robot's appearances. The results show that our method can recover the true human preferences while subject to such biased beliefs, in contrast to prior approaches that could have misinterpreted them completely.

show abstract

“…The dynamics settings and parameters follow the experiment in Section V-A, and the weight-feature cost function is set as (31). According to [17], for each of the human's corrections, we first utilize the trajectory deformation technique [18] to obtain the corresponding human intended trajectory. Specifically, given a correction a k , the human intended trajectory, denoted as ξθ k = {x θ k 0:T +1 , ūθ k 0:T }, can be solved by…”

Section: Comparison With Related Workmentioning

confidence: 99%

“…To handle the sparse corrections that a human user applies only at sparse time instances during the robot's motion, these methods apply the trajectory deformation technique [20] to interpret each single-time-step correction through a human indented trajectory, i.e., a deformed robot trajectory. Although achieving promising results, choosing the hyper-parameters in the trajectory deformation is challenging, which can affect the learning performance [18]. In addition, these methods have not provided any convergence guarantee of the learning process.…”

Section: Introductionmentioning

confidence: 99%

“…In fact, as we will demonstrate in Sections II and V-C, the more closer the robot is approaching to the expected trajectory, the more difficult the choice of a proper correction magnitude will be, which can lead to learning inefficiency. Also, for POMDP-based methods, when one applies the trajectory deformation, the choice of hyperparameters will determine the shape of the human intended trajectory and thus finally affect the learning performance, as discussed in [18].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning from Human Directional Corrections

Jin¹,

Murphey²,

Lu³

et al. 2020

Preprint

View full text Add to dashboard Cite

This paper proposes a technique which enables a robot to learn a control objective function incrementally from human user's corrections. The human's corrections can be as simple as directional corrections-corrections that indicate the direction of a control change without indicating its magnitudeapplied at some time instances during the robot's motion. We only assume that each of the human's corrections, regardless of its magnitude, points in a direction that improves the robot's current motion relative to an implicit objective function. The proposed method uses the direction of a correction to update the estimate of the objective function based on a cutting plane technique. We establish the theoretical results to show that this process of incremental correction and update guarantees convergence of the learned objective function to the implicit one. The method is validated by both simulations and two human-robot games, where human players teach a 2-link robot arm and a 6-DoF quadrotor system for motion planning in environments with obstacles.

show abstract

Learning from Extrapolated Corrections

Cited by 10 publications

References 8 publications

Active Preference Learning using Maximum Regret

Active Preference Learning using Maximum Regret

What Is It You Really Want of Me? Generalized Reward Learning with Biased Beliefs about Domain Dynamics

Learning from Human Directional Corrections

Contact Info

Product

Resources

About