Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis

Kaufmann, Emilie; Korda, Nathaniel; Munos, Rémi

doi:10.1007/978-3-642-34106-9_18

Cited by 345 publications

(380 citation statements)

References 8 publications

(7 reference statements)

Supporting

Mentioning

363

Contrasting

Unclassified

Order By: Relevance

“…Both UCB, kl-UCB and TS have been proved to have logarithmic regrets [14], [15], [16], meaning that R…”

Section: Classical Mab Algorithms : Ucb Kl-ucb Tsmentioning

confidence: 99%

Aggregation of multi-armed bandits learning algorithms for opportunistic spectrum access

Besson¹,

Kaufmann²,

Moy

2018

2018 IEEE Wireless Communications and Networking Conference (WCNC)

Self Cite

View full text Add to dashboard Cite

Abstract-Multi-armed bandit algorithms have been recently studied and evaluated for Cognitive Radio (CR), especially in the context of Opportunistic Spectrum Access (OSA). Several solutions have been explored based on various models, but it is hard to exactly predict which could be the best for realworld conditions at every instants. Hence, expert aggregation algorithms can be useful to select on the run the best algorithm for a specific situation. Aggregation algorithms, such as Exp4 dating back from 2002, have never been used for OSA learning, and we show that it appears empirically sub-efficient when applied to simple stochastic problems. In this article, we present an improved variant, called Aggregator. For synthetic OSA problems modeled as Multi-Armed Bandit (MAB) problems, simulation results are presented to demonstrate its empirical efficiency. We combine classical algorithms, such as Thompson sampling, Upper-Confidence Bounds algorithms (UCB and variants), and Bayesian or Kullback-Leibler UCB. Our algorithm offers good performance compared to state-of-the-art algorithms (Exp4, CORRAL or LearnExp), and appears as a robust approach to select on the run the best algorithm for any stochastic MAB problem, being more realistic to real-world radio settings than any tuning-based approach.

show abstract

“…Both UCB, kl-UCB and TS have been proved to have logarithmic regrets [14], [15], [16], meaning that R…”

Section: Classical Mab Algorithms : Ucb Kl-ucb Tsmentioning

confidence: 99%

Aggregation of multi-armed bandits learning algorithms for opportunistic spectrum access

Besson¹,

Kaufmann²,

Moy

2018

2018 IEEE Wireless Communications and Networking Conference (WCNC)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The theoretical analysis arguing that TS is a viable solution method to MAB problems is an active area of research (Kaufmann et al 2012;Ortega and Braun 2010;Russo and VanRoy 2014). While it may seem like a simple heuristic, TS has been shown to be an optimal policy with respect to minimizing finite-time regret (Agrawal and Goyal 2012;Kaufmann et al 2012), minimizing relative entropy (Ortega and Braun 2013), and min- The TS allocation rule encodes model uncertainty by drawing samples from the posterior.…”

Section: Optimization Problemmentioning

confidence: 99%

Customer Acquisition via Display Advertising Using Multi-Armed Bandit Experiments

Schwartz

Bradlow²,

Fader³

2017

Marketing Science

187

View full text Add to dashboard Cite

Firms using online advertising regularly run experiments with multiple versions of their ads since they are uncertain about which ones are most effective. During a campaign, firms try to adapt to intermediate results of their tests, optimizing what they earn while learning about their ads. Yet how should they decide what percentage of impressions to allocate to each ad? This paper answers that question, resolving the well-known "learn-and-earn" trade-off using multi-armed bandit (MAB) methods. The online advertiser's MAB problem, however, contains particular challenges, such as a hierarchical structure (ads within a website), attributes of actions (creative elements of an ad), and batched decisions (millions of impressions at a time), that are not fully accommodated by existing MAB methods. Our approach captures how the impact of observable ad attributes on ad effectiveness differs by website in unobserved ways, and our policy generates allocations of impressions that can be used in practice. We implemented this policy in a live field experiment delivering over 750 million ad impressions in an online display campaign with a large retail bank. Over the course of two months, our policy achieved an 8% improvement in the customer acquisition rate, relative to a control policy, without any additional costs to the bank. Beyond the actual experiment, we performed counterfactual simulations to evaluate a range of alternative model specifications and allocation rules in MAB policies. Finally, we show that customer acquisition would decrease by about 10% if the firm were to optimize clickthrough rates instead of conversion directly, a finding that has implications for understanding the marketing funnel. This paper answers that question, resolving the well-known "learn-and-earn" trade-off using multi-armed bandit (MAB) methods. The online advertiser's MAB problem, however, contains particular challenges, such as a hierarchical structure (ads within a website), attributes of actions (creative elements of an ad), and batched decisions (millions of impressions at a time), that are not fully accommodated by existing MAB methods. Our approach captures how the impact of observable ad attributes on ad effectiveness differs by website in unobserved ways, and our policy generates allocations of impressions that can be used in practice.We implemented this policy in a live field experiment delivering over 700 million ad impressions in an online display campaign with a large retail bank. Over the course of two months, our policy achieved an 8% improvement in the customer acquisition rate, relative to a control policy, without any additional costs to the bank. Beyond the actual experiment, we performed counterfactual simulations to evaluate a range of alternative model specifications and allocation rules in MAB policies. Finally, we show that customer acquisition would decrease about 10% if the firm were to optimize click through rates instead of conversion directly, a finding that has implications for understanding the marketi...

show abstract

“…In the case of the standard SMAB problem, bandit algorithms following the Bayesian approach, namely, Bayesian-UCB [19] and Thompson sampling [20], have stronger theoretically grounded guarantees and achieved better performance in experiments with other tasks [19,20] than the pointwise alternatives such as UCB-1. However, to the best of our knowledge, they have not been applied to the OLREE problem earlier.…”

Section: Bayesian Banditsmentioning

confidence: 99%

“…In the case of α low (t) = αup(t), this algorithm coincides with Bayesian-UCB [19], and if α low (t) = 0, αup(t) = 1, it reduces to Thompson sampling [20]. Note that we can store {γa,t, Wa,t, {pa,0(r)} r∈[0,1] } instead of {pa,t(r)} r∈ [0,1] and update only γa,t and Wa,t when we run Bayesian bandits in practice.…”

Section: Bayesian Banditsmentioning

confidence: 99%

Gathering Additional Feedback on Search Results by Multi-Armed Bandits with Respect to Production Ranking

Vorobev

Lefortier

Gusev

et al. 2015

Proceedings of the 24th International Conference on World Wide Web

View full text Add to dashboard Cite

Given a repeatedly issued query and a document with a notyet-confirmed potential to satisfy the users' needs, a search system should place this document on a high position in order to gather user feedback and obtain a more confident estimate of the document utility. On the other hand, the main objective of the search system is to maximize expected user satisfaction over a rather long period, what requires showing more relevant documents on average. The state-of-the-art approaches to solving this exploration-exploitation dilemma rely on strongly simplified settings making these approaches infeasible in practice. We improve the most flexible and pragmatic of them to handle some actual practical issues. The first one is utilizing prior information about queries and documents, the second is combining bandit-based learning approaches with a default production ranking algorithm. We show experimentally that our framework enables to significantly improve the ranking of a leading commercial search engine.

show abstract

Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis

Cited by 345 publications

References 8 publications

Aggregation of multi-armed bandits learning algorithms for opportunistic spectrum access

Aggregation of multi-armed bandits learning algorithms for opportunistic spectrum access

Customer Acquisition via Display Advertising Using Multi-Armed Bandit Experiments

Gathering Additional Feedback on Search Results by Multi-Armed Bandits with Respect to Production Ranking

Contact Info

Product

Resources

About