The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
Proceedings of the 22nd International Conference on Machine Learning - ICML '05 2005
DOI: 10.1145/1102351.1102459
|View full text |Cite
|
Sign up to set email alerts
|

A theoretical analysis of Model-Based Interval Estimation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
79
0

Year Published

2009
2009
2022
2022

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 85 publications
(79 citation statements)
references
References 8 publications
0
79
0
Order By: Relevance
“…An online learning algorithm is called PAC-MDP if this measure can be bounded with high probability as a polynomial function of the natural parameters of the MDP and if in each time step polynomially many computational steps are performed. Algorithms that are known to be PAC-MDP include Rmax (Brafman and Tennenholtz, 2002;Kakade, 2003), 14 MBIE (Strehl and Littman, 2005), Delayed Q-learning (Strehl et al, 2006), the optimistic-initialization-based algorithm of Szita and Lőrincz (2008), and MorMax by Szita and Szepesvári (2010). Of these, MorMax enjoys the best bound for the number of ε-suboptimal steps, T ε .…”
Section: Pac-mdp Algorithmsmentioning
confidence: 99%
“…An online learning algorithm is called PAC-MDP if this measure can be bounded with high probability as a polynomial function of the natural parameters of the MDP and if in each time step polynomially many computational steps are performed. Algorithms that are known to be PAC-MDP include Rmax (Brafman and Tennenholtz, 2002;Kakade, 2003), 14 MBIE (Strehl and Littman, 2005), Delayed Q-learning (Strehl et al, 2006), the optimistic-initialization-based algorithm of Szita and Lőrincz (2008), and MorMax by Szita and Szepesvári (2010). Of these, MorMax enjoys the best bound for the number of ε-suboptimal steps, T ε .…”
Section: Pac-mdp Algorithmsmentioning
confidence: 99%
“…Others have used confidence intervals on the MDP parameters themselves in a similar way in the discrete-state case [38]. Previous work has done similar exploration using GPs in supervised learning [22] and the bandit setting with continuous actions [9], but the latter is only for single-state RL whereas we explore in a full MDP.…”
Section: Related Workmentioning
confidence: 99%
“…Here we use a simpler strategy based on the "optimism in the face of uncertainty" principle, which has been a cornerstone of efficient exploration algorithms (e.g. [38]). …”
Section: Optimistic Exploration For Gpqmentioning
confidence: 99%
See 1 more Smart Citation
“…If a complete model of the environment is available, dynamic programming [10] can be used to compute an optimal value function, from which an optimal policy can be derived. If a model is not available, one can be learned from experience [26,44,65,68]. Alternatively, an optimal value function can be discovered via model-free techniques such as temporal difference (TD) methods [67], which combine elements of dynamic programming with Monte Carlo estimation [5].…”
Section: Introductionmentioning
confidence: 99%