Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Antos, András; Szepesvári, Csaba; Munos, Rémi

doi:10.1007/s10994-007-5038-2

Cited by 255 publications

(305 citation statements)

References 27 publications

Supporting

Mentioning

298

Contrasting

Unclassified

Order By: Relevance

“…While our current regret upper bounds seem to be sub-optimal in terms of H (we are not aware of any tight lower bound), in the future, we plan to deploy the analysis in [9,38] and develop a tighter regret upper bounds as well as an information theoretic lower bound. We also plan to extend the analysis in Abeille and Lazaric [4] and develop Thompson sampling methods with a performance guarantee and finally go beyond the linear models [34].…”

Section: Resultsmentioning

confidence: 99%

Efficient Exploration Through Bayesian Deep Q-Networks

Azizzadenesheli¹,

Brunskill

Anandkumar

2018

2018 Information Theory and Applications Workshop (ITA)

View full text Add to dashboard Cite

We study reinforcement learning (RL) in high dimensional episodic Markov decision processes (MDP). We consider value-based RL when the optimal Q-value is a linear function of d-dimensional state-action feature representation. For instance, in deep-Q networks (DQN), the Q-value is a linear function of the feature representation layer (output layer). We propose two algorithms, one based on optimism, LINUCB, and another based on posterior sampling, LINPSRL. We guarantee frequentist and Bayesian regret upper bounds of O(d √ T ) for these two algorithms, where T is the number of episodes. We extend these methods to deep RL and propose Bayesian deep Q-networks (BDQN), which uses an efficient Thompson sampling algorithm for high dimensional RL. We deploy the double DQN (DDQN) approach, and instead of learning the last layer of Q-network using linear regression, we use Bayesian linear regression, resulting in an approximated posterior over Q-function. This allows us to directly incorporate the uncertainty over the Q-function and deploy Thompson sampling on the learned posterior distribution resulting in efficient exploration/exploitation trade-off. We empirically study the behavior of BDQN on a wide range of Atari games. Since BDQN carries out more efficient exploration and exploitation, it is able to reach higher return substantially faster compared to DDQN.

show abstract

Section: Resultsmentioning

confidence: 99%

Efficient Exploration Through Bayesian Deep Q-Networks

Azizzadenesheli¹,

Brunskill

Anandkumar

2018

2018 Information Theory and Applications Workshop (ITA)

View full text Add to dashboard Cite

show abstract

“…PolicyEval can be any algorithm that computes an estimatê Q π of Q π , including: rollout-based estimation [44,45], LSTD-Q [43,27], modified Bellman Residual Minimization [3], and Fitted Q-Iteration [21,52,26], or a combination of rollouts and function approximation [33], as well as online algorithms such as TD [62] and GTD [57].…”

Section: Capi Frameworkmentioning

confidence: 99%

“…To address the limitation of rollout-based estimators, we propose the CAPI framework. CAPI generalizes the current classification-based algorithms by allowing the use any policy evaluation method including, but not limited to, rollout-based estimators (as in previous work [44,45]), LSTD [43,46], modified Bellman Residual Minimization [3], the policy evaluation version of Fitted Q-Iteration [21,52,15,24], and their regularized variants [27,37,26], as well as online methods for policy evaluation such as Temporal Difference learning [58,62] and GTD [57]. This is a significant generalization of existing classification-based RL algorithms, which become special cases of CAPI.…”

Section: Introductionmentioning

confidence: 99%

Classification-Based Approximate Policy Iteration

Farahmand

Precup

Barreto

et al. 2015

IEEE Trans. Automat. Contr.

View full text Add to dashboard Cite

Tackling large approximate dynamic programming or reinforcement learning problems requires methods that can exploit regularities of the problem in hand. Most current methods are geared towards exploiting the regularities of either the value function or the policy. We introduce a general classificationbased approximate policy iteration (CAPI) framework that can exploit regularities of both. We establish theoretical guarantees for the sample complexity of CAPI-style algorithms, which allow the policy evaluation step to be performed by a wide variety of algorithms, and can handle nonparametric representations of policies. Our bounds on the estimation error of the performance loss are tighter than existing results. 1

show abstract

“…Moreover let $ G M' 5 Although TD(A) is one of the most celebrated ideas in reinforcement learning, when it is combined with linear function approximation (named linear-TD(A)) and off-policy learning, stability of the solution is not guaranteed. Nevertheless, if linear-TD(0) converges, it is shown in [22] that the solution satisfies Vg = IVTVg, where T is a contraction mapping (the Bellman operator), and II is a projection operator that takes any value function and projects it to the nearest value function representable by the function approximation.…”

Section: Linear Approximation Of Value Functionsmentioning

confidence: 99%

Diffusion gradient temporal difference for cooperative reinforcement learning with linear function approximation

Macua

Belanović

Zazo

2012

2012 3rd International Workshop on Cognitive Information Processing (CIP)

View full text Add to dashboard Cite

Abstract-We introduce a diffusion-based algorithm in which multiple agents cooperate to predict a common and global statevalue function by sharing local estimates and local gradient information among neighbors. Our algorithm is a fully distributed implementation of the gradient temporal difference with linear function approximation, to make it applicable to multiagent settings. Simulations illustrate the benefit of cooperation in learning, as made possible by the proposed algorithm.

show abstract

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Cited by 255 publications

References 27 publications

Efficient Exploration Through Bayesian Deep Q-Networks

Efficient Exploration Through Bayesian Deep Q-Networks

Classification-Based Approximate Policy Iteration

Diffusion gradient temporal difference for cooperative reinforcement learning with linear function approximation

Contact Info

Product

Resources

About