A theoretical analysis of Model-Based Interval Estimation

Strehl, Alexander L.; Littman, Michael L.

doi:10.1145/1102351.1102459

Cited by 85 publications

(79 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An online learning algorithm is called PAC-MDP if this measure can be bounded with high probability as a polynomial function of the natural parameters of the MDP and if in each time step polynomially many computational steps are performed. Algorithms that are known to be PAC-MDP include Rmax (Brafman and Tennenholtz, 2002;Kakade, 2003), 14 MBIE (Strehl and Littman, 2005), Delayed Q-learning (Strehl et al, 2006), the optimistic-initialization-based algorithm of Szita and Lőrincz (2008), and MorMax by Szita and Szepesvári (2010). Of these, MorMax enjoys the best bound for the number of ε-suboptimal steps, T ε .…”

Section: Pac-mdp Algorithmsmentioning

confidence: 99%

Algorithms for Reinforcement Learning

Szepesvári¹

2010

Synthesis Lectures on Artificial Intelligence and Machine Learn

604

406

View full text Add to dashboard Cite

Section: Pac-mdp Algorithmsmentioning

confidence: 99%

Algorithms for Reinforcement Learning

Szepesvári¹

2010

Synthesis Lectures on Artificial Intelligence and Machine Learn

604

406

View full text Add to dashboard Cite

“…Others have used confidence intervals on the MDP parameters themselves in a similar way in the discrete-state case [38]. Previous work has done similar exploration using GPs in supervised learning [22] and the bandit setting with continuous actions [9], but the latter is only for single-state RL whereas we explore in a full MDP.…”

Section: Related Workmentioning

confidence: 99%

“…Here we use a simpler strategy based on the "optimism in the face of uncertainty" principle, which has been a cornerstone of efficient exploration algorithms (e.g. [38]). …”

Section: Optimistic Exploration For Gpqmentioning

confidence: 99%

See 1 more Smart Citation

Off-policy reinforcement learning with Gaussian processes

Chowdhary

Liu

Grande

et al. 2014

IEEE/CAA J. Autom. Sinica

View full text Add to dashboard Cite

An off-policy Bayesian nonparameteric approximate reinforcement learning framework, termed as GPQ, that employs a Gaussian Processes (GP) model of the value (Q) function is presented in both the batch and online settings. Sufficient conditions on GP hyperparameter selection are established to guarantee convergence of off-policy GPQ in the batch setting, and theoretical and practical extensions are provided for the online case. Empirical results demonstrate GPQ has competitive learning speeds in addition to its convergence guarantees and its ability to automatically choose its own bases locations.

show abstract

“…If a complete model of the environment is available, dynamic programming [10] can be used to compute an optimal value function, from which an optimal policy can be derived. If a model is not available, one can be learned from experience [26,44,65,68]. Alternatively, an optimal value function can be discovered via model-free techniques such as temporal difference (TD) methods [67], which combine elements of dynamic programming with Monte Carlo estimation [5].…”

Section: Introductionmentioning

confidence: 99%

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Whiteson

Taylor

Stone

2009

Auton Agent Multi-Agent Syst

View full text Add to dashboard Cite

Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address this shortcoming by presenting results of empirical comparisons between Sarsa and NEAT, two representative methods, in mountain car and keepaway, two benchmark reinforcement learning tasks. In each task, the methods are evaluated in combination with both linear and nonlinear representations to determine their best configurations. In addition, this article tests two specific hypotheses about the critical factors contributing to these methods' relative performance: (1) that sensor noise reduces the final performance of Sarsa more than that of NEAT, because Sarsa's learning updates are not reliable in the absence of the Markov property and (2) that stochasticity, by introducing noise in fitness estimates, reduces the learning speed of NEAT more than that of Sarsa. Experiments in variations of mountain car and keepaway designed to isolate these factors confirm both these hypotheses.

show abstract

A theoretical analysis of Model-Based Interval Estimation

Cited by 85 publications

References 8 publications

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning

Off-policy reinforcement learning with Gaussian processes

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Contact Info

Product

Resources

About