Near-Bayesian exploration in polynomial time

Kolter, J. Zico; Ng, Andrew Y.

doi:10.1145/1553374.1553441

Cited by 139 publications

(157 citation statements)

References 9 publications

Supporting

Mentioning

155

Contrasting

Order By: Relevance

“…In general, computing the Bayes optimal policy is often intractable, making approximation inevitable (see, e.g., Duff (2002)). It remains an active research area to develop efficient algorithms to approximate the Bayes optimal policy (Poupart et al, 2006;Kolter and Ng, 2009).…”

Section: Bayesian Frameworkmentioning

confidence: 99%

“…For instance, the analytic tools developed for PAC-MDP algorithms are used to derive the BEB algorithm that approximates the Bayes optimal policy except for polynomially many steps (Kolter and Ng, 2009). As another example, the notion of known state-actions is combined with the posterior distribution of models to yield a randomized PAC-MDP algorithm (BOSS) that is able to use prior knowledge about MDP models (Asmuth et al, 2009).…”

Section: Bayesian Frameworkmentioning

confidence: 99%

See 1 more Smart Citation

Sample Complexity Bounds of Exploration

2012

Adaptation, Learning, and Optimization

View full text Add to dashboard Cite

Efficient exploration is widely recognized as a fundamental challenge inherent in reinforcement learning. Algorithms that explore efficiently converge faster to near-optimal policies. While heuristics techniques are popular in practice, they lack formal guarantees and may not work well in general. This chapter studies algorithms with polynomial sample complexity of exploration, both model-based and model-free ones, in a unified manner. These so-called PAC-MDP algorithms behave near-optimally except in a "small" number of steps with high probability. A new learning model known as KWIK is used to unify most existing model-based PAC-MDP algorithms for various subclasses of Markov decision processes. We also compare the sample-complexity framework to alternatives for formalizing exploration efficiency such as regret minimization and Bayes optimal solutions.

show abstract

Section: Bayesian Frameworkmentioning

confidence: 99%

Section: Bayesian Frameworkmentioning

confidence: 99%

Sample Complexity Bounds of Exploration

2012

Adaptation, Learning, and Optimization

View full text Add to dashboard Cite

show abstract

“…: a uniform distribution consisting to assume each transition has been observed once). Bayesian Exploration Bonus (BEB) (Kolter and Ng, 2009a) builds the expected MDP given the current history at each timestep. The reward function of this MDP is slightly modified to give an exploration bonus to transitions which have been observed less frequently.…”

Section: State-of-the-artmentioning

confidence: 99%

“…First, sampling possible transition probabilities, based on past observations, relies on the computation of P( f |h t ) ∝ P(h t | f )P( f ), which is intractable for most probabilistic models (Duff, 2002;Kaelbling et al, 1998;Kolter and Ng, 2009b). Second, the BAMDP state space is actually made of all possible histories and is infinite.…”

Section: Solving Bamdpmentioning

confidence: 99%

“…They can be divided in two main classes: online methods, and offline methods. The former group (Fonteneau et al, 2013;Asmuth and Littman, 2011;Walsh et al, 2010;Kolter and Ng, 2009a) relies on sparse sampling of possible models based on the current observations, to reduce the number of transition probabilities computations. The latter group (Wang et al, 2012) uses the prior knowledge to train an agent able to act on all possible sequences of observations.…”

Section: Solving Bamdpmentioning

confidence: 99%

See 1 more Smart Citation

Approximate Bayes Optimal Policy Search using Neural Networks

Castronovo

François-Lavet

Fonteneau

et al. 2017

Proceedings of the 9th International Conference on Agents and Artificial Intelligence

View full text Add to dashboard Cite

Abstract:Bayesian Reinforcement Learning (BRL) agents aim to maximise the expected collected rewards obtained when interacting with an unknown Markov Decision Process (MDP) while using some prior knowledge. State-of-the-art BRL agents rely on frequent updates of the belief on the MDP, as new observations of the environment are made. This offers theoretical guarantees to converge to an optimum, but is computationally intractable, even on small-scale problems. In this paper, we present a method that circumvents this issue by training a parametric policy able to recommend an action directly from raw observations. Artificial Neural Networks (ANNs) are used to represent this policy, and are trained on the trajectories sampled from the prior. The trained model is then used online, and is able to act on the real MDP at a very low computational cost.Our new algorithm shows strong empirical performance, on a wide range of test problems, and is robust to inaccuracies of the prior distribution.

show abstract

Exploration Methods in Sparse Reward Environments

Hensel

2021

Reinforcement Learning Algorithms: Analysis and Applications

View full text Add to dashboard Cite

Near-Bayesian exploration in polynomial time

Cited by 139 publications

References 9 publications

Sample Complexity Bounds of Exploration

Sample Complexity Bounds of Exploration

Approximate Bayes Optimal Policy Search using Neural Networks

Exploration Methods in Sparse Reward Environments

Contact Info

Product

Resources

About