We present a formal model of human decisionmaking in explore-exploit tasks using the context of multiarmed bandit problems, where the decision-maker must choose among multiple options with uncertain rewards. We address the standard multi-armed bandit problem, the multi-armed bandit problem with transition costs, and the multi-armed bandit problem on graphs. We focus on the case of Gaussian rewards in a setting where the decision-maker uses Bayesian inference to estimate the reward values. We model the decision-maker's prior knowledge with the Bayesian prior on the mean reward. We develop the upper credible limit (UCL) algorithm for the standard multi-armed bandit problem and show that this deterministic algorithm achieves logarithmic cumulative expected regret, which is optimal performance for uninformative priors. We show how good priors and good assumptions on the correlation structure among arms can greatly enhance decision-making performance, even over short time horizons. We extend to the stochastic UCL algorithm and draw several connections to human decisionmaking behavior. We present empirical data from human experiments and show that human performance is efficiently captured by the stochastic UCL algorithm with appropriate parameters. For the multi-armed bandit problem with transition costs and the multi-armed bandit problem on graphs, we generalize the UCL algorithm to the block UCL algorithm and the graphical block UCL algorithm, respectively. We show that these algorithms also achieve logarithmic cumulative expected regret and require a sub-logarithmic expected number of transitions among arms. We further illustrate the performance of these algorithms with numerical examples.NB: Appendix G included in this version details minor modifications that correct for an oversight in the previouslypublished proofs. The remainder of the text reflects the published work.
Abstract-Satisficing is a relaxation of maximizing and allows for less risky decision making in the face of uncertainty. We propose two sets of satisficing objectives for the multi-armed bandit problem, where the objective is to achieve reward-based decision-making performance above a given threshold. We show that these new problems are equivalent to various standard multi-armed bandit problems with maximizing objectives and use the equivalence to find bounds on performance. The different objectives can result in qualitatively different behavior; for example, agents explore their options continually in one case and only a finite number of times in another. For the case of Gaussian rewards we show an additional equivalence between the two sets of satisficing objectives that allows algorithms developed for one set to be applied to the other. We then develop variants of the Upper Credible Limit (UCL) algorithm that solve the problems with satisficing objectives and show that these modified UCL algorithms achieve efficient satisficing performance.
Abstract-With an eye towards human-centered automation, we contribute to the development of a systematic means to infer features of human decision-making from behavioral data. Motivated by the common use of softmax selection in models of human decision-making, we study the maximum likelihood parameter estimation problem for softmax decision-making models with linear objective functions. We present conditions under which the likelihood function is convex. These allow us to provide sufficient conditions for convergence of the resulting maximum likelihood estimator and to construct its asymptotic distribution. In the case of models with nonlinear objective functions, we show how the estimator can be applied by linearizing about a nominal parameter value. We apply the estimator to fit the stochastic UCL (Upper Credible Limit) model of human decision-making to human subject data. We show statistically significant differences in behavior across related, but distinct, tasks.Note to Practitioners: Abstract-We propose and demonstrate a rigorous method to estimate parameters of softmax decision-making models. These decision-making models hold great promise for use in developing model-based human-centered automation. We are motivated by the recently derived UCL (Upper Credible Limit) model, which predicts the choices that humans are likely to make when deciding among alternatives with uncertain rewards. Key parameters of the model represent the human's intuition about the task, and estimating these parameters from behavioral data would allow an automated system to learn about its human supervisor. Our parameter estimation method is fast enough to be implemented in real time for most scenarios, although our analysis of the method holds when the model has a particular linear structure. We show how to extend the method to a more general nonlinear model using linearization, and we show that the linearization approach works for the motivating UCL model. The parameter estimation method with linearization can be used for other nonlinear models; however, the domain of its validity may vary.
We develop a dynamical systems approach to prioritizing and selecting multiple recurring tasks with the aim of conferring a degree of deliberative goal selection to a mobile robot confronted with competing objectives. We take navigation as our prototypical task, and use reactive (i.e., vector field) planners derived from navigation functions to encode control policies that achieve each individual task. We associate a scalar "value" with each task representing its current urgency and let that quantity evolve in time as the robot evaluates the importance of its assigned task relative to competing tasks. The robot's motion control input is generated as a convex combination of the individual task vector fields. Their weights, in turn, evolve dynamically according to a decision model adapted from the literature on bioinspired swarm decision making, driven by the values. In this paper we study a simple case with two recurring, competing navigation tasks and derive conditions under which it can be guaranteed that the robot will repeatedly serve each in turn. Specifically, we provide conditions sufficient for the emergence of a stable limit cycle along which the robot repeatedly and alternately navigates to the two goal locations. Numerical study suggests that the basin of attraction is quite large so that significant perturbations are recovered with a reliable return to the desired task coordination pattern.
Abstract-We consider two variants of the standard multiarmed bandit problem, namely, the multi-armed bandit problem with transition costs and the multi-armed bandit problem on graphs. We develop block allocation algorithms for these problems that achieve an expected cumulative regret that is uniformly dominated by a logarithmic function of time, and an expected cumulative number of transitions from one arm to another arm uniformly dominated by a double-logarithmic function of time. We observe that the multi-armed bandit problem with transition costs and the associated block allocation algorithm capture the key features of popular animal foraging models in literature.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.