A Structured Multiarmed Bandit Problem and the Greedy Policy

Mersereau, Adam J.; Rusmevichientong, Paat; Tsitsiklis, John N.

doi:10.1109/tac.2009.2031725

Cited by 71 publications

(65 citation statements)

References 31 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we suggest also to extend our policy learning scheme to other -and more complexexploration-exploitation problems than the one tackled in this paper, such as for example bandit problems where the arms are not statistically independent (Mersereau et al, 2009) or general Markov Decision processes (Ishii et al, 2002).…”

Section: Resultsmentioning

confidence: 99%

Learning to Play K-Armed Bandit Problems

Maes

Wehenkel

Ernst

2012

Proceedings of the 4th International Conference on Agents and Artificial Intelligence

View full text Add to dashboard Cite

Abstract:We propose a learning approach to pre-compute K-armed bandit playing policies by exploiting prior information describing the class of problems targeted by the player. Our algorithm first samples a set of K-armed bandit problems from the given prior, and then chooses in a space of candidate policies one that gives the best average performances over these problems. The candidate policies use an index for ranking the arms and pick at each play the arm with the highest index; the index for each arm is computed in the form of a linear combination of features describing the history of plays (e.g., number of draws, average reward, variance of rewards and higher order moments), and an estimation of distribution algorithm is used to determine its optimal parameters in the form of feature weights. We carry out simulations in the case where the prior assumes a fixed number of Bernoulli arms, a fixed horizon, and uniformly distributed parameters of the Bernoulli arms. These simulations show that learned strategies perform very well with respect to several other strategies previously proposed in the literature (UCB1, UCB2, UCB-V, KL-UCB and ε n -GREEDY); they also highlight the robustness of these strategies with respect to wrong prior information.

show abstract

Section: Resultsmentioning

confidence: 99%

Learning to Play K-Armed Bandit Problems

Maes

Wehenkel

Ernst

2012

Proceedings of the 4th International Conference on Agents and Artificial Intelligence

View full text Add to dashboard Cite

show abstract

“…Agrawal (1995) studies a multi-armed bandit in which arms represent points in the real line and their expected rewards are continuous functions of the arms. Mersereau et al (2009) and Rusmevichientong and Tsitsiklis (2010) study bandits with possibly infinite arms when expected rewards are linear functions of an (unknown) scalar and a vector, respectively. Our paper also relates to the literature on linear bandit models (see e.g., Abernethy et al (2008) and Dani et al (2008)) as the model we study is a linear stochastic bandit with a finite (but combinatorial) number of arms.…”

Section: Literature Reviewmentioning

confidence: 99%

Learning in Combinatorial Optimization: What and How to Explore

2016

View full text Add to dashboard Cite

“…Recent work has proposed an alternative model in which the arms have statistically dependent reward distributions, and therefore pulling one arm also gives information about other arms. In this setting, the correlation between payoffs of different arms allows for faster learning, even when the number of arms is very large (Dani et al 2008, Mersereau et al 2009). …”

Section: Related Workmentioning

confidence: 99%

The Value of Field Experiments

Rusmevichientong

Simester

et al. 2015

Management Science

View full text Add to dashboard Cite

The feasibility of using field experiments to optimize marketing decisions remains relatively unstudied.We investigate category pricing decisions that require estimating a large matrix of cross-product demand elasticities and ask: how many experiments are required as the number of products in the category grows?Our main result demonstrates that if the categories have a favorable structure then we can learn faster and reduce the number of experiments that are required: the number of experiments required may grow just logarithmically with the number of products. These findings potentially have important implications for the application of field experiments. Firms may be able to obtain meaningful estimates using a practically feasible number of experiments, even in categories with a large number of products. We also provide a relatively simple mechanism that firms can use to evaluate whether a category has a structure that makes it feasible to use field experiments to set prices. We illustrate how to accomplish this using either a sample of historical data or a pilot set of experiments. We also discuss how to evaluate whether field experiments can help optimize other marketing decisions, such as selecting which products to advertise or promote.

show abstract

A Structured Multiarmed Bandit Problem and the Greedy Policy

Cited by 71 publications

References 31 publications

Learning to Play K-Armed Bandit Problems

Learning to Play K-Armed Bandit Problems

Learning in Combinatorial Optimization: What and How to Explore

The Value of Field Experiments

Contact Info

Product

Resources

About