2013
DOI: 10.1109/jstsp.2013.2263494
|View full text |Cite
|
Sign up to set email alerts
|

Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems

Abstract: In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward models. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length T . An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that for all light-tailed reward distributions, DSEE achieves the optimal logarithmic order of the regret, where regret is defined … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
76
0

Year Published

2013
2013
2023
2023

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 80 publications
(77 citation statements)
references
References 27 publications
1
76
0
Order By: Relevance
“…However, the difference between the pseudo-regret defined in [11] and regret in its original definition is in the order of O( √ T ). They have also shown that a variation of the DSEE policy developed in [15] for risk-neutral MAB achieves O(T 2/3 ) regret performance without the positive difference assumption. While [11] only focuses on policy development, this paper also provides tight lower bounds on both the asymptotic and finite-time regret performance which serve as fundamental limits for gauging the optimality of learning policies.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However, the difference between the pseudo-regret defined in [11] and regret in its original definition is in the order of O( √ T ). They have also shown that a variation of the DSEE policy developed in [15] for risk-neutral MAB achieves O(T 2/3 ) regret performance without the positive difference assumption. While [11] only focuses on policy development, this paper also provides tight lower bounds on both the asymptotic and finite-time regret performance which serve as fundamental limits for gauging the optimality of learning policies.…”
Section: Related Workmentioning
confidence: 99%
“…A variation of the DSEE policy developed in [15] for risk-neutral MAB was considered in [11] and was shown to achieve O(T 2/3 ) finite-time regret performance. In the MV-DSEE policy, time is divided into two interleaving sequences: an exploration sequence denoted by E(t) and an exploitation sequence.…”
Section: B Risk-averse Learning Policiesmentioning
confidence: 99%
“…In [18] the authors presented multiuser based spectrum access model. When multi users are presented in network it induces Collison.…”
Section: Introductionmentioning
confidence: 99%
“…When multi users are presented in network it induces Collison. To address this in [18] the presented an adaptive random access model and in [19] authors presented fair access model and [20] authors presented priority access model to reduce the collision among cognitive users. It is seen from exiting research [21] [22] that most of these schemes are limited to provide only one channel at a time to a cognitive user.…”
Section: Introductionmentioning
confidence: 99%
“…For learning the shortest path, we can simply treat each path as an arm and directly apply existing MAB policies developed in [2][3][4][5][6]. This approach, however, results in a regret growing linearly with the number of paths, thus exponentially with the number of edges in the worst case.…”
Section: Introductionmentioning
confidence: 99%