The Knowledge Gradient for Optimal Learning

Powell, Warren B.

doi:10.1002/9780470400531.eorms0444

Cited by 21 publications

(19 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multiple proposals have been widely considered in the multi-armed bandits (MAB) literature for these heuristics, ranging from early examples like the Gittins index for infinite horizon problems (Gittins and Jones 1974) to more recent methods such as the knowledge gradient (Powell 2010). Here we describe several approximate policy selection mechanisms that we use for dealing with the policy reuse problem.…”

Section: Belief Over Typesmentioning

confidence: 99%

“…The final exploration heuristic we describe is the knowledge gradient (Powell 2010), which aims to balance exploration and exploitation through optimising myopic return whilst maintaining asymptotic optimality. The principle behind this approach is to estimate a one step look-ahead, and select the policy which maximises utility over both the current time step and the next in terms of the information gained.…”

Section: Knowledge Gradientmentioning

confidence: 99%

See 1 more Smart Citation

Bayesian policy reuse

2016

View full text Add to dashboard Cite

A long-lived autonomous agent should be able to respond online to novel instances of tasks from a familiar domain. Acting online requires 'fast' responses, in terms of rapid convergence, especially when the task instance has a short duration such as in applications involving interactions with humans. These requirements can be problematic for many established methods for learning to act. In domains where the agent knows that the task instance is drawn from a family of related tasks, albeit without access to the label of any given instance, it can choose to act through a process of policy reuse from a library in contrast to policy learning. In policy reuse, the agent has prior experience from the class of tasks in the form of a library of policies that were learnt from sample task instances during an offline training phase. We formalise the problem of policy reuse and present an algorithm for efficiently responding to a novel task instance by reusing a policy from this library of existing policies, where the choice is based on observed 'signals' which correlate to policy performance. We achieve this by posing the problem as a Bayesian choice problem with a corresponding notion of an optimal response, but the computation of that response is in many cases intractable. Therefore, to reduce the computation cost of the posterior, we follow a Bayesian optimisaEditor: Peter Flach.Benjamin Rosman and Majd Hawasly have contributed equally to this paper. 123Mach Learn tion approach and define a set of policy selection functions, which balance exploration in the policy library against exploitation of previously tried policies, together with a model of expected performance of the policy library on their corresponding task instances. We validate our method in several simulated domains of interactive, short-duration episodic tasks, showing rapid convergence in unknown task variations.

show abstract

Section: Belief Over Typesmentioning

confidence: 99%

Section: Knowledge Gradientmentioning

confidence: 99%

Bayesian policy reuse

2016

View full text Add to dashboard Cite

show abstract

“…Some of the most used policies that has succeeded in practical applications are pure-exploration, pure-exploitation, -greedy likes, Bolstzmann/softmax likes and interval-estimation (Powell, 2010). These policies try to balance exploration and exploitation focusing on minimizing regret (i.e., safer-exploration or risk-averse approach), without paying much attention to the value of the information to be obtained.…”

Section: Contextual Bandits: a Linear Bayes' Methodsmentioning

confidence: 97%

“…A totally different kind of approach is to explicitly choose an option to gain as much information as possible from every exploratory trial. This approach has been studied in depth by Frazier, Powell, and Dayanik (2008) and Powell (2010) as the knowledge gradient (KG) method. The KG is an explicit method to maximize the marginal value of the information obtained by exploring an alternative, that is, it ''identifies the measurement which will do the most to identify the best choice'' (Powell & Ryzhov, 2012).…”

Section: Contextual Bandits: a Linear Bayes' Methodsmentioning

confidence: 99%

Linear Bayes policy for learning in contextual-bandits

Martı́n

Vargas

2013

Expert Systems with Applications

View full text Add to dashboard Cite

a b s t r a c tMachine and Statistical Learning techniques are used in almost all online advertisement systems. The problem of discovering which content is more demanded (e.g. receive more clicks) can be modeled as a multi-armed bandit problem. Contextual bandits (i.e., bandits with covariates, side information or associative reinforcement learning) associate, to each specific content, several features that define the ''context'' in which it appears (e.g. user, web page, time, region). This problem can be studied in the stochastic/ statistical setting by means of the conditional probability paradigm using the Bayes' theorem. However, for very large contextual information and/or real-time constraints, the exact calculation of the Bayes' rule is computationally infeasible. In this article, we present a method that is able to handle large contextual information for learning in contextual-bandits problems. This method was tested in the Challenge on Yahoo! dataset at ICML2012's Workshop ''new Challenges for Exploration & Exploitation 3'', obtaining the second place. Its basic exploration policy is deterministic in the sense that for the same input data (as a time-series) the same results are obtained. We address the deterministic exploration vs. exploitation issue, explaining the way in which the proposed method deterministically finds an effective dynamic trade-off based solely in the input-data, in contrast to other methods that use a random number generator.

show abstract

“…This trade-off between learning and instant optimization is also frequently observed in dynamic pricing problems. The literature on multi-armed bandit problems is large; some key references are Thompson (1933), Robbins (1952, Lai and Robbins (1985), Gittins (1989), Auer et al (2002); see further Vermorel and Mohri (2005), Cesa-Bianchi and Lugosi (2006), Powell (2010. If in a dynamic pricing problem the number of admissible selling prices is finite, the problem can be modeled as a classical multiarmed bandit problem.…”

Section: Methodologically Related Areasmentioning

confidence: 99%

Dynamic pricing and learning: Historical origins, current research, and new directions

Boer

2015

Surveys in Operations Research and Management Science

294

View full text Add to dashboard Cite

Dynamic pricing and learning is a research topic that has received a considerable amount of attention in recent years, from different scientific communities: operations research and management science, marketing, economics, econometrics, and computer science. We survey these literature streams: we provide a brief introduction to the historical origins of quantitative research on pricing and demand estimation, point to different subfields in the area of dynamic pricing, and provide an in-depth overview of the available literature on dynamic pricing and learning. We discuss relations with methodologically related research areas, and identify several important directions for future research.

show abstract

The Knowledge Gradient for Optimal Learning

Cited by 21 publications

References 31 publications

Bayesian policy reuse

Bayesian policy reuse

Linear Bayes policy for learning in contextual-bandits

Dynamic pricing and learning: Historical origins, current research, and new directions

Contact Info

Product

Resources

About