2007
DOI: 10.1007/s10994-007-5038-2
|View full text |Cite
|
Sign up to set email alerts
|

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Abstract: To cite this version:Andras Antos, Csaba Szepesvari, Rémi Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal, Springer, 2008, pp.71:89-129. The date of receipt and acceptance will be inserted by the editor Abstract We consider the problem of finding a near-optimal policy using value-function methods in continuous space, discounted Markovian Decision Problems (MDP) when only a single trajectory … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
298
0
3

Year Published

2011
2011
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 255 publications
(305 citation statements)
references
References 27 publications
4
298
0
3
Order By: Relevance
“…While our current regret upper bounds seem to be sub-optimal in terms of H (we are not aware of any tight lower bound), in the future, we plan to deploy the analysis in [9,38] and develop a tighter regret upper bounds as well as an information theoretic lower bound. We also plan to extend the analysis in Abeille and Lazaric [4] and develop Thompson sampling methods with a performance guarantee and finally go beyond the linear models [34].…”
Section: Resultsmentioning
confidence: 99%
“…While our current regret upper bounds seem to be sub-optimal in terms of H (we are not aware of any tight lower bound), in the future, we plan to deploy the analysis in [9,38] and develop a tighter regret upper bounds as well as an information theoretic lower bound. We also plan to extend the analysis in Abeille and Lazaric [4] and develop Thompson sampling methods with a performance guarantee and finally go beyond the linear models [34].…”
Section: Resultsmentioning
confidence: 99%
“…PolicyEval can be any algorithm that computes an estimatê Q π of Q π , including: rollout-based estimation [44,45], LSTD-Q [43,27], modified Bellman Residual Minimization [3], and Fitted Q-Iteration [21,52,26], or a combination of rollouts and function approximation [33], as well as online algorithms such as TD [62] and GTD [57].…”
Section: Capi Frameworkmentioning
confidence: 99%
“…To address the limitation of rollout-based estimators, we propose the CAPI framework. CAPI generalizes the current classification-based algorithms by allowing the use any policy evaluation method including, but not limited to, rollout-based estimators (as in previous work [44,45]), LSTD [43,46], modified Bellman Residual Minimization [3], the policy evaluation version of Fitted Q-Iteration [21,52,15,24], and their regularized variants [27,37,26], as well as online methods for policy evaluation such as Temporal Difference learning [58,62] and GTD [57]. This is a significant generalization of existing classification-based RL algorithms, which become special cases of CAPI.…”
Section: Introductionmentioning
confidence: 99%
“…Moreover let $ G M' 5 Although TD(A) is one of the most celebrated ideas in reinforcement learning, when it is combined with linear function approximation (named linear-TD(A)) and off-policy learning, stability of the solution is not guaranteed. Nevertheless, if linear-TD(0) converges, it is shown in [22] that the solution satisfies Vg = IVTVg, where T is a contraction mapping (the Bellman operator), and II is a projection operator that takes any value function and projects it to the nearest value function representable by the function approximation.…”
Section: Linear Approximation Of Value Functionsmentioning
confidence: 99%