2020
DOI: 10.1609/aaai.v34i04.5784
|View full text |Cite
|
Sign up to set email alerts
|

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Abstract: We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 12 publications
0
7
0
Order By: Relevance
“…Q-learning uses bootstrapping as in equation (2) to estimate a Q-function, that is, the estimate at one iteration is used to derive the update for the next iteration’s estimates. Alternatives to bootstrapping include fixed-horizon temporal difference methods (De Asis et al, 2019) and finite-horizon Monte Carlo updates, in which a Q-value is estimated based on observed returns from each state. While the resulting estimator for the Q-function has low bias, it comes with high variance.…”
Section: Related Workmentioning
confidence: 99%
“…Q-learning uses bootstrapping as in equation (2) to estimate a Q-function, that is, the estimate at one iteration is used to derive the update for the next iteration’s estimates. Alternatives to bootstrapping include fixed-horizon temporal difference methods (De Asis et al, 2019) and finite-horizon Monte Carlo updates, in which a Q-value is estimated based on observed returns from each state. While the resulting estimator for the Q-function has low bias, it comes with high variance.…”
Section: Related Workmentioning
confidence: 99%
“…In addition, experience replay option in some DRL methods, if used with care, considerably improves performances of the methods (a good source on this subject, dealing with systems control, is reference [111]). DRL methods are not without the issues (particularly related to the convergence and sensitivity to involved parameters) [20], but huge undergoing research efforts will certainly offer solutions for the issues and increase interest in te use of DRL (interesting new results, in this respect, were presented in [112] with considerations of so-called "deadly triad" in RL: function approximation, bootstrapping, and off-policy learning).…”
Section: Perspectivesmentioning
confidence: 99%
“…This is an actively pursued research area where a series of solutions have been proposed (Sutton et al 2009;Maei 2011;van Hasselt, Mahmood, and Sutton 2014;Sutton, Mahmood, and White 2016), but these often suffer from either performing worse than off-policy TD when it does not diverge (Hackman 2013) or even from infinite variance (Sutton, Mahmood, and White 2016). Our approach is similar in spirit to (De Asis et al 2020) that estimate a new kind of return: fixed horizon returns (i.e. the rewards only from the next k steps) instead of the typical discounted return.…”
Section: Introductionmentioning
confidence: 99%