Active Learning of Reward Dynamics from Hierarchical Queries

Basu, Chandrayee; Bıyık, Erdem; He, Zhixun; Singhal, Mukesh; Sadigh, Dorsa

doi:10.1109/iros40897.2019.8968522

Cited by 23 publications

(19 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While having humans provide pairwise comparisons does not suffer from similar problems to collecting demonstrations, each comparison question is much less informative than a demonstration, because comparison queries can provide at most 1 bit of information. Prior works have attempted to tackle this problem by actively generating the comparison questions (Basu et al, 2019;Biyik and Sadigh, 2018;Katz et al, 2019;Sadigh et al, 2017;Wilde et al, 2019). Although they were able to achieve significant gains in terms of the required number of comparisons, we hypothesize that one can attain even better data efficiency by leveraging multiple sources of information, even when some sources might not completely align with the true reward functions, e.g., demonstrations as in the driving work by Basu et al (2017).…”

Section: Learning Reward Functions From Rankingsmentioning

confidence: 90%

“…We have previously developed several tools to improve the computational efficiency of volume removal, or to extend volume removal to better accommodate human users. These tools include batch optimization (Biyik and Sadigh, 2018;Bıyık et al, 2019), iterated correction (Palan et al, 2019), and dynamically changing reward functions (Basu et al, 2019). Importantly, the listed tools are agnostic to the details of volume removal: they simply require (a) the query generation algorithm to operate in a greedy manner while (b) maintaining a belief over v. Our proposed information gain approach for generating easy queries satisfies both of these requirements.…”

Section: Useful Tools and Extensionsmentioning

confidence: 99%

See 1 more Smart Citation

Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

Bıyık

Losey

Palan

et al. 2021

The International Journal of Robotics Research

Self Cite

View full text Add to dashboard Cite

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.

show abstract

Section: Learning Reward Functions From Rankingsmentioning

confidence: 90%

Section: Useful Tools and Extensionsmentioning

confidence: 99%

Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

Bıyık

Losey

Palan

et al. 2021

The International Journal of Robotics Research

Self Cite

View full text Add to dashboard Cite

show abstract

“…We consider these agents as human-like agents assuming that humans have less task handling capabilities but their creativity and higher risk tolerance leads to noisy rational decisions. To model noise in agent decisions, we incorporate a noisy rational model, a widely used human decision model in cognitive science [19]- [21], into our proposed framework. In particular, we give an agent the option to take any action with certain probability defined as…”

Section: A Effect Of Heterogeneity On Team Performance 1) Heterogeneity In Capabilitiesmentioning

confidence: 99%

Impact of Heterogeneity and Risk Aversion on Task Allocation in Multi-Agent Teams

Ghadami

Bayrak

et al. 2021

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

Cooperative multi-agent decision-making is a ubiquitous problem with many real-world applications. In many practical applications, it is desirable to design a multi-agent team with a heterogeneous composition where the agents can have different capabilities and levels of risk tolerance to address diverse requirements. While heterogeneity in multi-agent teams offers benefits, new challenges arise including how to find optimal heterogeneous team compositions and how to dynamically distribute tasks among agents in complex operations. In this work, we develop an artificial intelligence framework for multi-agent heterogeneous teams to dynamically learn task distributions among agents through reinforcement learning. The framework extends Decentralized Partially Observable Markov Decision Processes (Dec-POMDP) to be compatible to model various types of heterogeneity. We demonstrate our approach with a benchmark problem on a disaster relief scenario. The effect of heterogeneity and risk aversion in agent capabilities and decision-making strategies on the performance of multi-agent teams in uncertain environments is analyzed. Results show that a well-designed heterogeneous team outperforms its homogeneous counterpart and possesses higher adaptivity in uncertain environments.

show abstract

“…Other approaches-such as those that use probabilistic methods to learn a task [6]- [10]-also rely on highly skilled demonstrators, accounting for imperfections with relatively smallscale noise in the probabilistic representation. To enable task learning that more closely captures human preferences and accounts for imperfect or incomplete demonstrations, active learning methods have been developed [11]- [15]. In these approaches, the human is treated as an oracle that the autonomy can query, improving learning quality.…”

Section: Related Workmentioning

confidence: 99%

Ergodic imitation: Learning from what to do and what not to do

Kalinowska

Prabhakar

Fitzsimons

et al. 2021

2021 IEEE International Conference on Robotics and Automation (ICRA)

View full text Add to dashboard Cite

With growing access to versatile robotics, it is beneficial for end users to be able to teach robots tasks without needing to code a control policy. One possibility is to teach the robot through successful task executions. However, nearoptimal demonstrations of a task can be difficult to provide and even successful demonstrations can fail to capture task aspects key to robust skill replication. Here, we propose a learning from demonstration (LfD) approach that enables learning of robust task definitions without the need for near-optimal demonstrations. We present a novel algorithmic framework for learning tasks based on the ergodic metric-a measure of information content in motion. Moreover, we make use of negative demonstrations-demonstrations of what not to do-and show that they can help compensate for imperfect demonstrations, reduce the number of demonstrations needed, and highlight crucial task elements improving robot performance. In a proofof-concept example of cart-pole inversion, we show that negative demonstrations alone can be sufficient to successfully learn and recreate a skill. Through a human subject study with 24 participants, we show that consistently more information about a task can be captured from combined positive and negative (posneg) demonstrations than from the same amount of just positive demonstrations. Finally, we demonstrate our learning approach on simulated tasks of target reaching and table cleaning with a 7-DoF Franka arm. Our results point towards a future with robust, data-efficient LfD for novice users.

show abstract

Active Learning of Reward Dynamics from Hierarchical Queries

Cited by 23 publications

References 26 publications

Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

Impact of Heterogeneity and Risk Aversion on Task Allocation in Multi-Agent Teams

Ergodic imitation: Learning from what to do and what not to do

Contact Info

Product

Resources

About