In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.Entropy 2018, 20, 155 2 of 33 or she currently perceives to be the best. This is referred to as exploitation, since the gambler is leveraging his or her existing knowledge about the pay-off statistics to choose an appropriate machine.A reasonable balance between exploration and exploitation is needed, even for this simple environment. Without an ability to explore, the gambler may fail to discover that one slot machine has a higher average return than the others. Without an ability to exploit, the gambler may fail to pull the best arm often enough to obtain a high pay-off.Many methods have been developed that attempt to optimally explore for the discrete, stochastic multi-armed-bandit abstraction. One way to quantify their success has been through regret (see appendix A). Regret is the expected total loss in rewards that is incurred due to not pulling the optimal arm at the current episode. Lai and Robbins have proved that a gambler's regret over a finite number of pulls can be bounded below by the logarithm of the number of pulls [5]. There is no gambling strategy with better asymptotic performance.There are approaches available that can achieve logarithmic regret for stochastic, multi-armed bandits. Prominent examples include the upper-confidence-bound method [6,7] and its many extensions [8,9], Thompson sampling [2,10], and the minimum empirical divergence algorithm [11,12]. The stochastic-based epsilon-greedy [13] and the stochastic-based exponential-weight-explore-or-exploit algorithms [13,14] can also obtain logarithmic regret. Most of these approaches adopt reasonable assumptions about the rewards: they assume that the supplied rewards are independent random variables with stationary means.In this paper, we propose a stochastic exploration tactic that, for this abstraction, has a weaker assumption on the reward distributions than some bandit algorithms. That is, we assume that the distribution of each new random reward can depend, in an adversarial way, on the previous pulls and observed rewards, provided that the reward mean is fixed. Our approach is based on the notion of the value of information due to Stratonovich [15,16], which implements an information-theoretic optimization criterion.We have previously applied the value of inf...