Balancing exploration and exploitation in reinforcement learning using a value of information criterion

Sledge, Isaac J.; Prı́ncipe, José C.

doi:10.1109/icassp.2017.7952670

Cited by 27 publications

(16 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our previous work, we showed that these desires can be realized by leveraging information that the states carry about the actions [35]. This idea of utilizing information in decision-making was developed into a rigorous theory by Stratonovich [4,5].…”

Section: Methodsmentioning

confidence: 99%

“…The optimization of (3.1), under the constraint (3.2), can be performed by first converting the criterion into an unconstrained problem using the theory of Lagrange multipliers. Differentiating the unconstrained criterion and solving for the policy probabilities leads to an alternating, expectation-maximization-type update [35]. These alternating updates yield a soft-max-based action-selection process [1] for exploring the policy search space.…”

Section: Value Of Information: Random Exploration Casementioning

confidence: 99%

See 1 more Smart Citation

Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion

Sledge

Emigh

Prı́ncipe

2018

IEEE Trans. Neural Netw. Learning Syst.

Self Cite

View full text Add to dashboard Cite

Reinforcement learning in environments with many action-state pairs is challenging. The issue is the number of episodes needed to thoroughly search the policy space. Most conventional heuristics address this search problem in a stochastic manner. This can leave large portions of the policy space unvisited during the early training stages. In this paper, we propose an uncertainty-based, information-theoretic approach for performing guided stochastic searches that more effectively cover the policy space. Our approach is based on the value of information, a criterion that provides the optimal tradeoff between expected costs and the granularity of the search process. The value of information yields a stochastic routine for choosing actions during learning that can explore the policy space in a coarse to fine manner. We augment this criterion with a state-transition uncertainty factor, which guides the search process into previously unexplored regions of the policy space. We evaluate the uncertainty-based value-of-information policies on the games Centipede and Crossy Road. Our results indicate that our approach yields better performing policies in fewer episodes than stochastic-based exploration strategies. We show that the training rate for our approach can be further improved by using the policy cross entropy to guide our criterion's hyperparameter selection.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Value Of Information: Random Exploration Casementioning

confidence: 99%

Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion

Sledge

Emigh

Prı́ncipe

2018

IEEE Trans. Neural Netw. Learning Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…We have previously shown that the value of information can be applied to multi-state, multi-action decision-making problems that can be solved using reinforcement learning [ 17 , 19 , 37 ]. Here, we simplify this criterion from the multi-state case to that of the single-state so that it is suitable for addressing the multi-armed bandit problem.…”

Section: Methodsmentioning

confidence: 99%

An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

Sledge

Prı́ncipe

2018

Entropy

View full text Add to dashboard Cite

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.Entropy 2018, 20, 155 2 of 33 or she currently perceives to be the best. This is referred to as exploitation, since the gambler is leveraging his or her existing knowledge about the pay-off statistics to choose an appropriate machine.A reasonable balance between exploration and exploitation is needed, even for this simple environment. Without an ability to explore, the gambler may fail to discover that one slot machine has a higher average return than the others. Without an ability to exploit, the gambler may fail to pull the best arm often enough to obtain a high pay-off.Many methods have been developed that attempt to optimally explore for the discrete, stochastic multi-armed-bandit abstraction. One way to quantify their success has been through regret (see appendix A). Regret is the expected total loss in rewards that is incurred due to not pulling the optimal arm at the current episode. Lai and Robbins have proved that a gambler's regret over a finite number of pulls can be bounded below by the logarithm of the number of pulls [5]. There is no gambling strategy with better asymptotic performance.There are approaches available that can achieve logarithmic regret for stochastic, multi-armed bandits. Prominent examples include the upper-confidence-bound method [6,7] and its many extensions [8,9], Thompson sampling [2,10], and the minimum empirical divergence algorithm [11,12]. The stochastic-based epsilon-greedy [13] and the stochastic-based exponential-weight-explore-or-exploit algorithms [13,14] can also obtain logarithmic regret. Most of these approaches adopt reasonable assumptions about the rewards: they assume that the supplied rewards are independent random variables with stationary means.In this paper, we propose a stochastic exploration tactic that, for this abstraction, has a weaker assumption on the reward distributions than some bandit algorithms. That is, we assume that the distribution of each new random reward can depend, in an adversarial way, on the previous pulls and observed rewards, provided that the reward mean is fixed. Our approach is based on the notion of the value of information due to Stratonovich [15,16], which implements an information-theoretic optimization criterion.We have previously applied the value of inf...

show abstract

“…Balancing between exploration and exploitation is solved by [7] which depends on Stratonovich's value of information which consists of two steps. The first one generates the base line of agent performance by measuring the achievable return of a policy in where there is no information regarding the states, afterward offsets these costs with a term that evaluates the average penalties when the state-action information is bounded above by a prescribed amount.…”

Section: Related Workmentioning

confidence: 99%

Smart Start and HER for a Directed and Persistent Reinforcement Learning Exploration in Discrete Environment

Alrakh¹,

Fahmi²,

Nor³

2020

IJACSA

View full text Add to dashboard Cite

Reinforcement learning (RL) solves sequential decision making problems through trial and error, through experiences can be amassed to achieve goals and increase the accumulative rewards. Exploration-exploitation dilemma is a critical challenge in reinforcement learning, particularly environments with misleading or sparse rewards which have shown difficulties to construct a suitable exploration strategy. In this paper a framework for Smart Start (SS) and Hindsight experience replay (HER) is developed to improve the performance of SS and make the exploration more directed especially in the early episodes. The framework Smart Start and Hindsight experience replay (SS+HER) was studied in discrete maze environment with sparse rewards. The results reveal that the framework doubles the rewards at the early episodes and decreases the time of the agent to reach the goal.

show abstract

Balancing exploration and exploitation in reinforcement learning using a value of information criterion

Cited by 27 publications

References 15 publications

Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion

Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion

An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

Smart Start and HER for a Directed and Persistent Reinforcement Learning Exploration in Discrete Environment

Contact Info

Product

Resources

About