“…An online learning algorithm is called PAC-MDP if this measure can be bounded with high probability as a polynomial function of the natural parameters of the MDP and if in each time step polynomially many computational steps are performed. Algorithms that are known to be PAC-MDP include Rmax (Brafman and Tennenholtz, 2002;Kakade, 2003), 14 MBIE (Strehl and Littman, 2005), Delayed Q-learning (Strehl et al, 2006), the optimistic-initialization-based algorithm of Szita and Lőrincz (2008), and MorMax by Szita and Szepesvári (2010). Of these, MorMax enjoys the best bound for the number of ε-suboptimal steps, T ε .…”
“…An online learning algorithm is called PAC-MDP if this measure can be bounded with high probability as a polynomial function of the natural parameters of the MDP and if in each time step polynomially many computational steps are performed. Algorithms that are known to be PAC-MDP include Rmax (Brafman and Tennenholtz, 2002;Kakade, 2003), 14 MBIE (Strehl and Littman, 2005), Delayed Q-learning (Strehl et al, 2006), the optimistic-initialization-based algorithm of Szita and Lőrincz (2008), and MorMax by Szita and Szepesvári (2010). Of these, MorMax enjoys the best bound for the number of ε-suboptimal steps, T ε .…”
“…Others have used confidence intervals on the MDP parameters themselves in a similar way in the discrete-state case [38]. Previous work has done similar exploration using GPs in supervised learning [22] and the bandit setting with continuous actions [9], but the latter is only for single-state RL whereas we explore in a full MDP.…”
Section: Related Workmentioning
confidence: 99%
“…Here we use a simpler strategy based on the "optimism in the face of uncertainty" principle, which has been a cornerstone of efficient exploration algorithms (e.g. [38]). …”
Section: Optimistic Exploration For Gpqmentioning
confidence: 99%
“…The first change uses the upper tail of the next state's Q-value in the Bellman update to maintain optimism of the value function and is reminiscent of the backups performed in Model-Based Interval Estimation [38]. The second change makes the algorithm select greedy actions with respect to an optimistic Q-function.…”
An off-policy Bayesian nonparameteric approximate reinforcement learning framework, termed as GPQ, that employs a Gaussian Processes (GP) model of the value (Q) function is presented in both the batch and online settings. Sufficient conditions on GP hyperparameter selection are established to guarantee convergence of off-policy GPQ in the batch setting, and theoretical and practical extensions are provided for the online case. Empirical results demonstrate GPQ has competitive learning speeds in addition to its convergence guarantees and its ability to automatically choose its own bases locations.
“…If a complete model of the environment is available, dynamic programming [10] can be used to compute an optimal value function, from which an optimal policy can be derived. If a model is not available, one can be learned from experience [26,44,65,68]. Alternatively, an optimal value function can be discovered via model-free techniques such as temporal difference (TD) methods [67], which combine elements of dynamic programming with Monte Carlo estimation [5].…”
Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address this shortcoming by presenting results of empirical comparisons between Sarsa and NEAT, two representative methods, in mountain car and keepaway, two benchmark reinforcement learning tasks. In each task, the methods are evaluated in combination with both linear and nonlinear representations to determine their best configurations. In addition, this article tests two specific hypotheses about the critical factors contributing to these methods' relative performance: (1) that sensor noise reduces the final performance of Sarsa more than that of NEAT, because Sarsa's learning updates are not reliable in the absence of the Markov property and (2) that stochasticity, by introducing noise in fitness estimates, reduces the learning speed of NEAT more than that of Sarsa. Experiments in variations of mountain car and keepaway designed to isolate these factors confirm both these hypotheses.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.