An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

Sledge, Isaac J.; Prı́ncipe, José C.

doi:10.3390/e20030155

Cited by 8 publications

(7 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If a causal model specifying how actions affect outcome probabilities (e.g., reward and state transition probabilities in an MDP) is not available, a key new principle is needed to guide decision-making. When causal models and optimal policies are initially uncertain, actions are valued not only for the rewards and state transitions that they cause, but also for the value of the information that they reveal about how to improve policies [ 52 ]. Managing the famous exploration–exploitation tradeoff between applying the most promising policy discovered so far (exploitation), and deviating from it to discover whether a different policy might perform better (exploration) requires taking into account the value of information (VoI) produced by actions ( ibid ).…”

Section: Structure Of Explanations For Decision Recommendations Based On Reinforcement Learning (Rl) With Initially Unknown or Uncertain mentioning

confidence: 99%

Information Structures for Causally Explainable Decisions

Cox

2021

Entropy

View full text Add to dashboard Cite

For an AI agent to make trustworthy decision recommendations under uncertainty on behalf of human principals, it should be able to explain why its recommended decisions make preferred outcomes more likely and what risks they entail. Such rationales use causal models to link potential courses of action to resulting outcome probabilities. They reflect an understanding of possible actions, preferred outcomes, the effects of action on outcome probabilities, and acceptable risks and trade-offs—the standard ingredients of normative theories of decision-making under uncertainty, such as expected utility theory. Competent AI advisory systems should also notice changes that might affect a user’s plans and goals. In response, they should apply both learned patterns for quick response (analogous to fast, intuitive “System 1” decision-making in human psychology) and also slower causal inference and simulation, decision optimization, and planning algorithms (analogous to deliberative “System 2” decision-making in human psychology) to decide how best to respond to changing conditions. Concepts of conditional independence, conditional probability tables (CPTs) or models, causality, heuristic search for optimal plans, uncertainty reduction, and value of information (VoI) provide a rich, principled framework for recognizing and responding to relevant changes and features of decision problems via both learned and calculated responses. This paper reviews how these and related concepts can be used to identify probabilistic causal dependencies among variables, detect changes that matter for achieving goals, represent them efficiently to support responses on multiple time scales, and evaluate and update causal models and plans in light of new data. The resulting causally explainable decisions make efficient use of available information to achieve goals in uncertain environments.

show abstract

Section: Structure Of Explanations For Decision Recommendations Based On Reinforcement Learning (Rl) With Initially Unknown or Uncertain mentioning

confidence: 99%

Information Structures for Causally Explainable Decisions

Cox

2021

Entropy

View full text Add to dashboard Cite

show abstract

“…Further advantages of our approach are provided in an associated online supplement. We also encourage readers to consult [40,41] for more comparisons of the value of information with existing reinforcement learning strategies.…”

Section: Introductionmentioning

confidence: 99%

“…Convergence for many of these tree-search procedures is, currently, only guaranteed for simplified problem domains and for certain exploration strategies [30]. Their theoretical performance, in terms of a criterion known as regret, is not optimal in these domains, unlike the value of information [40].…”

Section: Introductionmentioning

confidence: 99%

Analysis of Agent Expertise in Ms. Pac-Man Using Value-of-Information-Based Policies

Sledge

Prı́ncipe

2019

IEEE Trans. Games

View full text Add to dashboard Cite

Conventional reinforcement learning methods for Markov decision processes rely on weakly-guided, stochastic searches to drive the learning process. It can therefore be difficult to predict what agent behaviors might emerge. In this paper, we consider an information-theoretic cost function for performing constrained stochastic searches that promote the formation of risk-averse to risk-favoring behaviors. This cost function is the value of information, which provides the optimal trade-off between the expected return of a policy and the policy's complexity; policy complexity is measured by number of bits and controlled by a single hyperparameter on the cost function. As the policy complexity is reduced, the agents will increasingly eschew risky actions. This reduces the potential for high accrued rewards. As the policy complexity increases, the agents will take actions, regardless of the risk, that can raise the long-term rewards. The obtainable reward depends on a single, tunable hyperparameter that regulates the degree of policy complexity.We evaluate the performance of value-of-information-based policies on a stochastic version of Ms. Pac-Man. A major component of this paper is the demonstration that ranges of policy complexity values yield different game-play styles and explaining why this occurs. We also show that our reinforcementlearning search mechanism is more efficient than the others we utilize. This result implies that the value of information theory is appropriate for framing the exploitation-exploration trade-off in reinforcement learning.Index Terms-Value of information, constrained search, reinforcement learning, information theory Isaac J. Sledge is with the 2 objective of the agent is to clear the environment of pellets while navigating around the ghosts. However, after activating certain power-ups, the ghosts become vulnerable for a brief period of time. The agent can consume these ghosts for a score boost.The switch in ghost dynamics necessitates a change in the game-play strategy, since multiple distinct modes of behavior are required under different conditions [4,5]. Despite the need for multi-modal behaviors, conventional reinforcement-learning approaches have focused on constructing monolithic policies. Such policies would implement the same agent behaviors regardless of the vulnerability of the ghosts. Although it is possible to represent multimodal behavior with these policies, it can be difficult to learn such behavior. This is, in part, due to risk. For instance, throughout the learning process, an agent may have learned to avoid colliding with the ghosts. Without straying from this behavior, the agent will not learn that there are instances where it can safely chase the ghosts.In this paper, we consider an information-theoretic learning [6] approach for performing constrained stochastic searches that promote a continuum of risk-averse to risk-favoring agent behaviors during reinforcement learning. This, in turn, leads to a principled exploration of the state-action space that aids in the...

show abstract

“…The problem of finding an aggregated Markov chain that captures much of the dynamics in the original chain can be posed as a cost function that uses the above divergence. For the second process, we consider the use of an information-theoretic criterion known as the value of information [ 13 , 14 , 15 ] to efficiently segment the probability transition graph. It proves a partition matrix

as the optimal solution of minimizing

with respect to

, where

is a marginal probability and

is a hyperparameter.…”

Section: Introductionmentioning

confidence: 99%

Reduction of Markov Chains Using a Value-of-Information-Based Approach

Sledge

Prı́ncipe

2019

Entropy

Self Cite

View full text Add to dashboard Cite

In this paper, we propose an approach to obtain reduced-order models of Markov chains. Our approach is composed of two information-theoretic processes. The first is a means of comparing pairs of stationary chains on different state spaces, which is done via the negative Kullback-Leibler divergence defined on a model joint space. Model reduction is achieved by solving a value-of-information criterion with respect to this divergence. Optimizing the criterion leads to a probabilistic partitioning of the states in the high-order Markov chain. A single free parameter that emerges through the optimization process dictates both the partition uncertainty and the number of state groups. We provide a data-driven means of choosing the 'optimal' value of this free parameter, which sidesteps needing to a priori know the number of state groups in an arbitrary chain.

show abstract

An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

Cited by 8 publications

References 29 publications

Information Structures for Causally Explainable Decisions

Information Structures for Causally Explainable Decisions

Analysis of Agent Expertise in Ms. Pac-Man Using Value-of-Information-Based Policies

Reduction of Markov Chains Using a Value-of-Information-Based Approach

Contact Info

Product

Resources

About