2013
DOI: 10.1109/tit.2012.2230215
|View full text |Cite
|
Sign up to set email alerts
|

Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics

Abstract: We consider the restless multiarmed bandit problem with unknown dynamics in which a player chooses one out of arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which arm is the most rewarding and always plays the best arm… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
151
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 134 publications
(156 citation statements)
references
References 42 publications
(87 reference statements)
1
151
0
Order By: Relevance
“…In a robotics application, the need for adaptive interaction that takes into account habituation has been recently formulated for empathic behavior [12] (in this paper, we take a more general approach). Going back to the problem of preference dynamics, our problem can formally be compared to the restless multiarmed bandit problem where rewards are non-stationary and which is generally known to be P-SPACE hard [5]. In this work, we restrict the rewards to evolve according to one of three models, which makes the problem of learning the model parameters easier to solve.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In a robotics application, the need for adaptive interaction that takes into account habituation has been recently formulated for empathic behavior [12] (in this paper, we take a more general approach). Going back to the problem of preference dynamics, our problem can formally be compared to the restless multiarmed bandit problem where rewards are non-stationary and which is generally known to be P-SPACE hard [5]. In this work, we restrict the rewards to evolve according to one of three models, which makes the problem of learning the model parameters easier to solve.…”
Section: Related Workmentioning
confidence: 99%
“…The problem can hence be compared to the Multi-Armed Bandit problem where a single player, choosing at each time step one to play one out of several possible arms and gets a reward for it, aims to maximize total reward (or equivalently minimize total regret) [5]. In our case, the rewards are stochastic and non-stationary and the arms or actions, corresponding to the different interaction options, are relatively few.…”
Section: Problem Settingmentioning
confidence: 99%
“…The first term is the expected total reward of the ideal policy by time t, because constantly playing the arms which give the largest average reward θ i can be considered to be optimal. As in [4] and [5], if we have to measure the performance of RMAB policies, we use…”
Section: A New Definition Of Regretmentioning
confidence: 99%
“…Thus, settings where all channels (arms) are identical for all users with i.i.d. rewards have been considered, and index-type policies that can achieve coordination have been proposed that get O(log T ) regret uniformly over time [14], [15], [16], [10]. A similar result for Markovian reward model with weak regret has been shown by [10], assuming some non-trivial bounds on the underlying Markov chains are known a priori.…”
Section: Introductionmentioning
confidence: 97%
“…[9] proposes another simpler policy which achieves the same bounds for weak regret. [10] proposes a policy based on deterministic sequence of exploration and exploitation and achieves the same bounds for weak regret. In [11], the authors consider the notion of strong regret and propose a policy which achieves near-log T (strong) regret for some special cases of the restless model.…”
Section: Introductionmentioning
confidence: 99%