2017
DOI: 10.48550/arxiv.1704.09028
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

Abstract: The literature on bandit learning and regret analysis has focused on contexts where the goal is to converge on an optimal action in a manner that limits exploration costs. One shortcoming imposed by this orientation is that it does not treat time preference in a coherent manner. Time preference plays an important role when the optimal action is costly to learn relative to near-optimal actions. This limitation has not only restricted the relevance of theoretical results but has also influenced the design of alg… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
10
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
3
3

Relationship

2
4

Authors

Journals

citations
Cited by 8 publications
(11 citation statements)
references
References 11 publications
1
10
0
Order By: Relevance
“…Bootstrapping attains an empirical complexity of O(N 3 ), confirming the findings of Osband et al [4]. However, in many cases (N = 23,27,33,37,43,47,51,55) there was one seed for which bootstrapping either did not learn within the step limit (green dashed line, empty dots) due to premature convergence, or learned after substantially more steps than the average (red dots, large error bar). Missing algorithms (random, -greedy, UCB1, approximate Thompson sampling) performed extremely poorly, often not learning within 500,000 steps even for small N, and are thus not reported.…”
Section: Empirical Sample Complexity On the Deep Seasupporting
confidence: 84%
See 1 more Smart Citation
“…Bootstrapping attains an empirical complexity of O(N 3 ), confirming the findings of Osband et al [4]. However, in many cases (N = 23,27,33,37,43,47,51,55) there was one seed for which bootstrapping either did not learn within the step limit (green dashed line, empty dots) due to premature convergence, or learned after substantially more steps than the average (red dots, large error bar). Missing algorithms (random, -greedy, UCB1, approximate Thompson sampling) performed extremely poorly, often not learning within 500,000 steps even for small N, and are thus not reported.…”
Section: Empirical Sample Complexity On the Deep Seasupporting
confidence: 84%
“…Similarly to UCB-based methods, Thompson sampling is guaranteed to converge to an optimal policy in multiarmed bandit problems [38,39], and has shown strong empirical performance [40,41]. For a discussion of known shortcomings of Thompson sampling, we refer to [42][43][44].…”
Section: Related Workmentioning
confidence: 99%
“…This issue is discussed further in [49]. That paper proposes and analyzes satisficing Thompson sampling, a variant of Thompson sampling that is designed to minimize the exploration costs required to identify an action that is sufficiently close to optimal.…”
Section: Limitations Of Thompson Samplingmentioning
confidence: 99%
“…Thompson Sampling (or UCB approaches) would never select such actions, even if they are worth their cost (Russo & Van Roy, 2014). In addition, Thompson Sampling does not take into account the time horizon where the process ends, and if known, exploration efforts should be tuned accordingly (Russo et al, 2017). Nonetheless, under the assumption that very accurate posterior approximations lead to efficient decisions, the question is: what happens when the approximations are not so accurate?…”
mentioning
confidence: 99%