Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

Russo, Daniel; Tse, David; Roy, Benjamin Van

doi:10.48550/arxiv.1704.09028

Cited by 8 publications

(11 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bootstrapping attains an empirical complexity of O(N 3 ), confirming the findings of Osband et al [4]. However, in many cases (N = 23,27,33,37,43,47,51,55) there was one seed for which bootstrapping either did not learn within the step limit (green dashed line, empty dots) due to premature convergence, or learned after substantially more steps than the average (red dots, large error bar). Missing algorithms (random, -greedy, UCB1, approximate Thompson sampling) performed extremely poorly, often not learning within 500,000 steps even for small N, and are thus not reported.…”

Section: Empirical Sample Complexity On the Deep Seasupporting

confidence: 84%

See 1 more Smart Citation

Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning

Parisi¹,

Tateo

Hensel

et al. 2022

Algorithms

View full text Add to dashboard Cite

Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods that use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment.

show abstract

Section: Empirical Sample Complexity On the Deep Seasupporting

confidence: 84%

“…Similarly to UCB-based methods, Thompson sampling is guaranteed to converge to an optimal policy in multiarmed bandit problems [38,39], and has shown strong empirical performance [40,41]. For a discussion of known shortcomings of Thompson sampling, we refer to [42][43][44].…”

Section: Related Workmentioning

confidence: 99%

Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning

Parisi¹,

Tateo

Hensel

et al. 2022

Algorithms

View full text Add to dashboard Cite

show abstract

“…This issue is discussed further in [49]. That paper proposes and analyzes satisficing Thompson sampling, a variant of Thompson sampling that is designed to minimize the exploration costs required to identify an action that is sufficiently close to optimal.…”

Section: Limitations Of Thompson Samplingmentioning

confidence: 99%

A Tutorial on Thompson Sampling

Russo

Roy

Kazerouni

et al. 2018

FNT in Machine Learning

Self Cite

387

209

View full text Add to dashboard Cite

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product assortment, recommendation, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.

show abstract

“…Thompson Sampling (or UCB approaches) would never select such actions, even if they are worth their cost (Russo & Van Roy, 2014). In addition, Thompson Sampling does not take into account the time horizon where the process ends, and if known, exploration efforts should be tuned accordingly (Russo et al, 2017). Nonetheless, under the assumption that very accurate posterior approximations lead to efficient decisions, the question is: what happens when the approximations are not so accurate?…”

mentioning

confidence: 99%

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

Riquelme¹,

Tucker²,

Snoek³

2018

Preprint

102

View full text Add to dashboard Cite

Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to posterior samples of the model. At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical. Thus, it is attractive to consider approximate Bayesian neural networks in a Thompson Sampling framework. To understand the impact of using an approximate posterior on Thompson Sampling, we benchmark well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems. We found that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario. In particular, we highlight the challenge of adapting slowly converging uncertainty estimates to the online setting.

show abstract

Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

Cited by 8 publications

References 11 publications

Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning

Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning

A Tutorial on Thompson Sampling

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

Contact Info

Product

Resources

About