1994
DOI: 10.1007/bf02191765
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Armed bandit problem revisited

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
15
0

Year Published

1996
1996
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 33 publications
(15 citation statements)
references
References 7 publications
0
15
0
Order By: Relevance
“…There are methods, like the Gittins allocation indices, that allow to find the optimal machine to play at each time n by considering each reward process independently from the others (even though the globally optimal solution depends on all the processes). However, computation of the Gittins indices for the average (undiscounted) reward criterion used here requires preliminary knowledge about the reward processes (see, e.g., Ishikida & Varaiya, 1994). To overcome this requirement, one can learn the Gittins indices, as proposed in Duff (1995) for the case of finite-state Markovian reward processes.…”
Section: Discussionmentioning
confidence: 99%
“…There are methods, like the Gittins allocation indices, that allow to find the optimal machine to play at each time n by considering each reward process independently from the others (even though the globally optimal solution depends on all the processes). However, computation of the Gittins indices for the average (undiscounted) reward criterion used here requires preliminary knowledge about the reward processes (see, e.g., Ishikida & Varaiya, 1994). To overcome this requirement, one can learn the Gittins indices, as proposed in Duff (1995) for the case of finite-state Markovian reward processes.…”
Section: Discussionmentioning
confidence: 99%
“…As a remark, note that a deterministic bandit problem was also considered by Gittins [9] and Ishikida and Varaiya [13]. However, their version of the bandit problem is very different from ours: they assume that the player can compute ahead of time exactly what payoffs will be received from each arm, and their problem is thus one of optimization, rather than exploration and exploitation.…”
Section: Introductionmentioning
confidence: 94%
“…Similar ideas were also used by Mandelbaum [12] and by Varaiya et al [17,9]. We now consider N bandit processes, with initial state Z(0) = i.…”
Section: Second Proof: Interleaving Of Prevailing Chargesmentioning
confidence: 92%
“…We can now construct a sequence of stopping times, with (9) which will continue indefinitely, or will reach IP(σ n0 = τ (i)) = 1, in which case we define σ n = τ (i), n > n 0 .…”
Section: Theorem 2 the Supremum Ofmentioning
confidence: 99%