2009
DOI: 10.1109/tac.2009.2031725
|View full text |Cite
|
Sign up to set email alerts
|

A Structured Multiarmed Bandit Problem and the Greedy Policy

Abstract: We consider a multiarmed bandit problem where the expected reward of each arm is a linear function of an unknown scalar with a prior distribution. The objective is to choose a sequence of arms that maximizes the expected total (or discounted total) reward. We demonstrate the effectiveness of a greedy policy that takes advantage of the known statistical correlation structure among the arms. In the infinite horizon discounted reward setting, we show that the greedy and optimal policies eventually coincide, and b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
64
0

Year Published

2010
2010
2016
2016

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 71 publications
(65 citation statements)
references
References 31 publications
(39 reference statements)
1
64
0
Order By: Relevance
“…Finally, we suggest also to extend our policy learning scheme to other -and more complexexploration-exploitation problems than the one tackled in this paper, such as for example bandit problems where the arms are not statistically independent (Mersereau et al, 2009) or general Markov Decision processes (Ishii et al, 2002).…”
Section: Resultsmentioning
confidence: 99%
“…Finally, we suggest also to extend our policy learning scheme to other -and more complexexploration-exploitation problems than the one tackled in this paper, such as for example bandit problems where the arms are not statistically independent (Mersereau et al, 2009) or general Markov Decision processes (Ishii et al, 2002).…”
Section: Resultsmentioning
confidence: 99%
“…Agrawal (1995) studies a multi-armed bandit in which arms represent points in the real line and their expected rewards are continuous functions of the arms. Mersereau et al (2009) and Rusmevichientong and Tsitsiklis (2010) study bandits with possibly infinite arms when expected rewards are linear functions of an (unknown) scalar and a vector, respectively. Our paper also relates to the literature on linear bandit models (see e.g., Abernethy et al (2008) and Dani et al (2008)) as the model we study is a linear stochastic bandit with a finite (but combinatorial) number of arms.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Recent work has proposed an alternative model in which the arms have statistically dependent reward distributions, and therefore pulling one arm also gives information about other arms. In this setting, the correlation between payoffs of different arms allows for faster learning, even when the number of arms is very large (Dani et al 2008, Mersereau et al 2009). …”
Section: Related Workmentioning
confidence: 99%