Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics

Liu, Haoyang; Liu, Keqin; Zhao, Qing

doi:10.1109/tit.2012.2230215

Cited by 134 publications

(156 citation statements)

References 42 publications

(87 reference statements)

Supporting

Mentioning

151

Contrasting

Order By: Relevance

“…In a robotics application, the need for adaptive interaction that takes into account habituation has been recently formulated for empathic behavior [12] (in this paper, we take a more general approach). Going back to the problem of preference dynamics, our problem can formally be compared to the restless multiarmed bandit problem where rewards are non-stationary and which is generally known to be P-SPACE hard [5]. In this work, we restrict the rewards to evolve according to one of three models, which makes the problem of learning the model parameters easier to solve.…”

Section: Related Workmentioning

confidence: 99%

“…The problem can hence be compared to the Multi-Armed Bandit problem where a single player, choosing at each time step one to play one out of several possible arms and gets a reward for it, aims to maximize total reward (or equivalently minimize total regret) [5]. In our case, the rewards are stochastic and non-stationary and the arms or actions, corresponding to the different interaction options, are relatively few.…”

Section: Problem Settingmentioning

confidence: 99%

See 1 more Smart Citation

Adaptive Interaction of Persistent Robots to User Temporal Preferences

Baraka

Veloso

2015

Social Robotics

View full text Add to dashboard Cite

Abstract. We look at the problem of enabling a mobile service robot to autonomously adapt to user preferences over repeated interactions in a long-term time frame, where the user provides feedback on every interaction in the form of a rating. We assume that the robot has a discrete and finite set of interaction options from which it has to choose one at every encounter with a given user. We first present three models of users which span the spectrum of possible preference profiles and their dynamics, incorporating aspects such as boredom and taste for change or surprise. Second, given the model to which the user belongs to, we present a learning algorithm which is able to successfully learn the model parameters. We show the applicability of our framework to personalizing light animations on our mobile service robot, CoBot.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Problem Settingmentioning

confidence: 99%

Adaptive Interaction of Persistent Robots to User Temporal Preferences

Baraka

Veloso

2015

Social Robotics

View full text Add to dashboard Cite

show abstract

“…The first term is the expected total reward of the ideal policy by time t, because constantly playing the arms which give the largest average reward θ i can be considered to be optimal. As in [4] and [5], if we have to measure the performance of RMAB policies, we use…”

Section: A New Definition Of Regretmentioning

confidence: 99%

A Channel Allocation Algorithm for Cognitive Radio Systems Using Restless Multi-Armed Bandit

Lee

2013

2013 IEEE 78th Vehicular Technology Conference (VTC Fall)

View full text Add to dashboard Cite

The cognitive radio (CR) system in which multiple secondary users (SU) search for spectrum opportunities generated by the absence of primary users (PU) is considered in this paper. The occupancy of a CR channel is modeled as a Markov chain, and it is assumed that the Markov chain has only two states: idle or busy. Since parameters of the Markov chain are unknown to SUs a priori and the states transit independently of the sensing and utilization of SUs, this problem can be considered as a kind of RMAB (restless multi-armed bandit) problem. We propose an efficient channel allocation algorithm for SUs, which is constructed through combination of multiple single-user MAB policies. When a performance of the proposed algorithm is measured by regret which is defined as the total reward difference from the ideal Bayesian policy in which the stationary probability is known to SUs, the order of regret growth of the proposed algorithm seems to be negatively decreasing, giving a better performance than any other existing policy under the 2-state Markov chain case. In order to estimate the performance of the proposed algorithm appropriately, we introduce a new definition of the regret, which uses a belief vector based Bayesian policy as the ideal policy. We observe experimentally that the order of the newly defined regret in the proposed algorithm is similar to logarithmic order under certain conditions.

show abstract

“…Thus, settings where all channels (arms) are identical for all users with i.i.d. rewards have been considered, and index-type policies that can achieve coordination have been proposed that get O(log T ) regret uniformly over time [14], [15], [16], [10]. A similar result for Markovian reward model with weak regret has been shown by [10], assuming some non-trivial bounds on the underlying Markov chains are known a priori.…”

Section: Introductionmentioning

confidence: 97%

“…[9] proposes another simpler policy which achieves the same bounds for weak regret. [10] proposes a policy based on deterministic sequence of exploration and exploitation and achieves the same bounds for weak regret. In [11], the authors consider the notion of strong regret and propose a policy which achieves near-log T (strong) regret for some special cases of the restless model.…”

Section: Introductionmentioning

confidence: 99%

Decentralized learning for multi-player multi-armed bandits

Kalathil

Nayyar

Jain

2012

2012 IEEE 51st IEEE Conference on Decision and Control (CDC)

View full text Add to dashboard Cite

We consider the problem of distributed online learning with multiple players in multi-armed bandits (MAB) models. Each player can pick among multiple arms. When a player picks an arm, it gets a reward. Any other communication between the users is costly and will add to the regret. We propose an online index-based distributed learning policy called dUCB 4 algorithm that trades off exploration v. exploitation in the right way, and achieves expected regret that grows at most as near-O(log 2 T ). The motivation comes from opportunistic spectrum access by multiple secondary users in cognitive radio networks wherein they must pick among various wireless channels that look different to different users. This is the first distributed learning algorithm for multi-player MABs to the best of our knowledge. Index TermsDistributed adaptive control, multi-armed bandit, online learning, multi-agent systems.

show abstract

Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics

Cited by 134 publications

References 42 publications

Adaptive Interaction of Persistent Robots to User Temporal Preferences

Adaptive Interaction of Persistent Robots to User Temporal Preferences

A Channel Allocation Algorithm for Cognitive Radio Systems Using Restless Multi-Armed Bandit

Decentralized learning for multi-player multi-armed bandits

Contact Info

Product

Resources

About