Wiley Encyclopedia of Operations Research and Management Science 2011
DOI: 10.1002/9780470400531.eorms0444
|View full text |Cite
|
Sign up to set email alerts
|

The Knowledge Gradient for Optimal Learning

Abstract: Optimal learning addresses the problem of how to collect information so that it benefits future decisions. For off‐line problems, we have to make a series of measurements or observations before choosing a final design or set of parameters; for on‐line problems, we learn from rewards we are receiving, and we want to strike a balance between rewards earned now and better decisions in the future. This article reviews these problems, describes optimal and heuristic policies, and shows how to compare competing poli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 21 publications
(19 citation statements)
references
References 31 publications
0
17
0
Order By: Relevance
“…Multiple proposals have been widely considered in the multi-armed bandits (MAB) literature for these heuristics, ranging from early examples like the Gittins index for infinite horizon problems (Gittins and Jones 1974) to more recent methods such as the knowledge gradient (Powell 2010). Here we describe several approximate policy selection mechanisms that we use for dealing with the policy reuse problem.…”
Section: Belief Over Typesmentioning
confidence: 99%
See 1 more Smart Citation
“…Multiple proposals have been widely considered in the multi-armed bandits (MAB) literature for these heuristics, ranging from early examples like the Gittins index for infinite horizon problems (Gittins and Jones 1974) to more recent methods such as the knowledge gradient (Powell 2010). Here we describe several approximate policy selection mechanisms that we use for dealing with the policy reuse problem.…”
Section: Belief Over Typesmentioning
confidence: 99%
“…The final exploration heuristic we describe is the knowledge gradient (Powell 2010), which aims to balance exploration and exploitation through optimising myopic return whilst maintaining asymptotic optimality. The principle behind this approach is to estimate a one step look-ahead, and select the policy which maximises utility over both the current time step and the next in terms of the information gained.…”
Section: Knowledge Gradientmentioning
confidence: 99%
“…Some of the most used policies that has succeeded in practical applications are pure-exploration, pure-exploitation, -greedy likes, Bolstzmann/softmax likes and interval-estimation (Powell, 2010). These policies try to balance exploration and exploitation focusing on minimizing regret (i.e., safer-exploration or risk-averse approach), without paying much attention to the value of the information to be obtained.…”
Section: Contextual Bandits: a Linear Bayes' Methodsmentioning
confidence: 97%
“…A totally different kind of approach is to explicitly choose an option to gain as much information as possible from every exploratory trial. This approach has been studied in depth by Frazier, Powell, and Dayanik (2008) and Powell (2010) as the knowledge gradient (KG) method. The KG is an explicit method to maximize the marginal value of the information obtained by exploring an alternative, that is, it ''identifies the measurement which will do the most to identify the best choice'' (Powell & Ryzhov, 2012).…”
Section: Contextual Bandits: a Linear Bayes' Methodsmentioning
confidence: 99%
“…This trade-off between learning and instant optimization is also frequently observed in dynamic pricing problems. The literature on multi-armed bandit problems is large; some key references are Thompson (1933), Robbins (1952, Lai and Robbins (1985), Gittins (1989), Auer et al (2002); see further Vermorel and Mohri (2005), Cesa-Bianchi and Lugosi (2006), Powell (2010. If in a dynamic pricing problem the number of admissible selling prices is finite, the problem can be modeled as a classical multiarmed bandit problem.…”
Section: Methodologically Related Areasmentioning
confidence: 99%