An analytic solution to discrete Bayesian reinforcement learning

Poupart, Pascal; Vlassis, Nikos; Hoey, Jesse; Regan, Kevin

doi:10.1145/1143844.1143932

Cited by 178 publications

(155 citation statements)

References 13 publications

Supporting

Mentioning

153

Contrasting

Order By: Relevance

“…1 shows per-step regret of the algorithms as the function of the number of states. As predicted by the theoretical bounds, the per-step regret ∆ of UCRL2 significantly increases as the number of states increases, whereas the average regret of our RLPA is essentially independent of the state space size 10 . Although UCWM has a lower regret than RLPA for a small number of states, it quickly loses its advantage as the number of states grows.…”

Section: Methodssupporting

confidence: 58%

Regret Bounds for Reinforcement Learning with Policy Advice

Azar

Lazaric

Brunskill

2013

Advanced Information Systems Engineering

155

373

View full text Add to dashboard Cite

Abstract. In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors. We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand. We prove that RLPA has a sub-linear regret of O( √ T ) relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action space. Our empirical simulations support our theoretical analysis. This suggests RLPA may offer significant advantages in large domains where some prior good policies are provided.

show abstract

Section: Methodssupporting

confidence: 58%

Regret Bounds for Reinforcement Learning with Policy Advice

Azar

Lazaric

Brunskill

2013

Advanced Information Systems Engineering

155

373

View full text Add to dashboard Cite

show abstract

“…In general, computing the Bayes optimal policy is often intractable, making approximation inevitable (see, e.g., Duff (2002)). It remains an active research area to develop efficient algorithms to approximate the Bayes optimal policy (Poupart et al, 2006;Kolter and Ng, 2009).…”

Section: Bayesian Frameworkmentioning

confidence: 99%

Sample Complexity Bounds of Exploration

2012

Adaptation, Learning, and Optimization

View full text Add to dashboard Cite

Efficient exploration is widely recognized as a fundamental challenge inherent in reinforcement learning. Algorithms that explore efficiently converge faster to near-optimal policies. While heuristics techniques are popular in practice, they lack formal guarantees and may not work well in general. This chapter studies algorithms with polynomial sample complexity of exploration, both model-based and model-free ones, in a unified manner. These so-called PAC-MDP algorithms behave near-optimally except in a "small" number of steps with high probability. A new learning model known as KWIK is used to unify most existing model-based PAC-MDP algorithms for various subclasses of Markov decision processes. We also compare the sample-complexity framework to alternatives for formalizing exploration efficiency such as regret minimization and Bayes optimal solutions.

show abstract

“…This augmented state representation results in a enormous state space, making the full Bayesian algorithm intractable. Attempts have been made to approximate the full algorithm by parameterizing the model and tying model parameters together [7] or sampling from the model distribution [8], [9], but these methods are still only tested in domains with 5-36 states. In addition to requiring a large amount of time to compute a policy, these methods must maintain a belief state over the model and require the user to create a well-defined model prior.…”

Section: Related Workmentioning

confidence: 99%

Real time targeted exploration in large domains

Hester

Stone

2010

2010 IEEE 9th International Conference on Development and Learning

View full text Add to dashboard Cite

Abstract-A developing agent needs to explore to learn about the world and learn good behaviors. In many real world tasks, this exploration can take far too long, and the agent must make decisions about which states to explore, and which states not to explore. Bayesian methods attempt to address this problem, but take too much computation time to run in reasonably sized domains. In this paper, we present TEXPLORE, the first algorithm to perform targeted exploration in real time in large domains. The algorithm learns multiple possible models of the domain that generalize action effects across states. We experiment with possible ways of adding intrinsic motivation to the agent to drive exploration. TEXPLORE is fully implemented and tested in a novel domain called Fuel World that is designed to reflect the type of targeted exploration needed in the real world. We show that our algorithm significantly outperforms representative examples of both model-free and model-based RL algorithms from the literature and is able to quickly learn to perform well in a large world in real-time.

show abstract

An analytic solution to discrete Bayesian reinforcement learning

Cited by 178 publications

References 13 publications

Regret Bounds for Reinforcement Learning with Policy Advice

Regret Bounds for Reinforcement Learning with Policy Advice

Sample Complexity Bounds of Exploration

Real time targeted exploration in large domains

Contact Info

Product

Resources

About