2018
DOI: 10.3390/e20030155
|View full text |Cite
|
Sign up to set email alerts
|

An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

Abstract: In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(7 citation statements)
references
References 29 publications
0
7
0
Order By: Relevance
“…If a causal model specifying how actions affect outcome probabilities (e.g., reward and state transition probabilities in an MDP) is not available, a key new principle is needed to guide decision-making. When causal models and optimal policies are initially uncertain, actions are valued not only for the rewards and state transitions that they cause, but also for the value of the information that they reveal about how to improve policies [ 52 ]. Managing the famous exploration–exploitation tradeoff between applying the most promising policy discovered so far (exploitation), and deviating from it to discover whether a different policy might perform better (exploration) requires taking into account the value of information (VoI) produced by actions ( ibid ).…”
Section: Structure Of Explanations For Decision Recommendations Based On Reinforcement Learning (Rl) With Initially Unknown or Uncertain mentioning
confidence: 99%
“…If a causal model specifying how actions affect outcome probabilities (e.g., reward and state transition probabilities in an MDP) is not available, a key new principle is needed to guide decision-making. When causal models and optimal policies are initially uncertain, actions are valued not only for the rewards and state transitions that they cause, but also for the value of the information that they reveal about how to improve policies [ 52 ]. Managing the famous exploration–exploitation tradeoff between applying the most promising policy discovered so far (exploitation), and deviating from it to discover whether a different policy might perform better (exploration) requires taking into account the value of information (VoI) produced by actions ( ibid ).…”
Section: Structure Of Explanations For Decision Recommendations Based On Reinforcement Learning (Rl) With Initially Unknown or Uncertain mentioning
confidence: 99%
“…Further advantages of our approach are provided in an associated online supplement. We also encourage readers to consult [40,41] for more comparisons of the value of information with existing reinforcement learning strategies.…”
Section: Introductionmentioning
confidence: 99%
“…Convergence for many of these tree-search procedures is, currently, only guaranteed for simplified problem domains and for certain exploration strategies [30]. Their theoretical performance, in terms of a criterion known as regret, is not optimal in these domains, unlike the value of information [40].…”
Section: Introductionmentioning
confidence: 99%
“…The problem of finding an aggregated Markov chain that captures much of the dynamics in the original chain can be posed as a cost function that uses the above divergence. For the second process, we consider the use of an information-theoretic criterion known as the value of information [ 13 , 14 , 15 ] to efficiently segment the probability transition graph. It proves a partition matrix as the optimal solution of minimizing with respect to , where is a marginal probability and is a hyperparameter.…”
Section: Introductionmentioning
confidence: 99%