Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Asis, Kristopher De; Chan, Alan; Pitis, Silviu; Sutton, Richard S.; Graves, Daniel

doi:10.1609/aaai.v34i04.5784

Cited by 12 publications

(7 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Q-learning uses bootstrapping as in equation (2) to estimate a Q-function, that is, the estimate at one iteration is used to derive the update for the next iteration’s estimates. Alternatives to bootstrapping include fixed-horizon temporal difference methods (De Asis et al, 2019) and finite-horizon Monte Carlo updates, in which a Q-value is estimated based on observed returns from each state. While the resulting estimator for the Q-function has low bias, it comes with high variance.…”

Section: Related Workmentioning

confidence: 99%

Stabilizing deep Q-learning with Q-graph-based bounds

Hoppe

Giftthaler²,

Krug

et al. 2023

The International Journal of Robotics Research

View full text Add to dashboard Cite

State-of-the art deep reinforcement learning has enabled autonomous agents to learn complex strategies from scratch on many problems including continuous control tasks. Deep Q-networks (DQN) and deep deterministic policy gradients (DDPGs) are two such algorithms which are both based on Q-learning. They therefore all share function approximation, off-policy behavior, and bootstrapping—the constituents of the so-called deadly triad that is known for its convergence issues. We suggest to take a graph perspective on the data an agent has collected and show that the structure of this data graph is linked to the degree of divergence that can be expected. We further demonstrate that a subset of states and actions from the data graph can be selected such that the resulting finite graph can be interpreted as a simplified Markov decision process (MDP) for which the Q-values can be computed analytically. These Q-values are lower bounds for the Q-values in the original problem, and enforcing these bounds in temporal difference learning can help to prevent soft divergence. We show further effects on a simulated continuous control task, including improved sample efficiency, increased robustness toward hyperparameters as well as a better ability to cope with limited replay memory. Finally, we demonstrate the benefits of our method on a large robotic benchmark with an industrial assembly task and approximately 60 h of real-world interaction.

show abstract

Section: Related Workmentioning

confidence: 99%

Stabilizing deep Q-learning with Q-graph-based bounds

Hoppe

Giftthaler²,

Krug

et al. 2023

The International Journal of Robotics Research

View full text Add to dashboard Cite

show abstract

“…In addition, experience replay option in some DRL methods, if used with care, considerably improves performances of the methods (a good source on this subject, dealing with systems control, is reference [111]). DRL methods are not without the issues (particularly related to the convergence and sensitivity to involved parameters) [20], but huge undergoing research efforts will certainly offer solutions for the issues and increase interest in te use of DRL (interesting new results, in this respect, were presented in [112] with considerations of so-called "deadly triad" in RL: function approximation, bootstrapping, and off-policy learning).…”

Section: Perspectivesmentioning

confidence: 99%

(Deep) Reinforcement learning for electric power system control and related problems: A short review and perspectives

Glavić

2019

Annual Reviews in Control

106

View full text Add to dashboard Cite

This paper reviews existing works on (deep) reinforcement learning considerations in electric power system control. The works are reviewed as they relate to electric power system operating states (normal, preventive, emergency, restorative) and control levels (local, household, microgrid, subsystem, wide-area). Due attention is paid to the control-related problems considerations (cyber-security, big data analysis, short-term load forecast, and composite load modelling). Observations from reviewed literature are drawn and perspectives discussed. In order to make the text compact and as easy as possible to read, the focus is only on the works published (or "in press") in journals and books while conference publications are not included. Exceptions are several work available in open repositories likely to become journal publications in near future. Hopefully this paper could serve as a good source of information for all those interested in solving similar problems.

show abstract

“…This is an actively pursued research area where a series of solutions have been proposed (Sutton et al 2009;Maei 2011;van Hasselt, Mahmood, and Sutton 2014;Sutton, Mahmood, and White 2016), but these often suffer from either performing worse than off-policy TD when it does not diverge (Hackman 2013) or even from infinite variance (Sutton, Mahmood, and White 2016). Our approach is similar in spirit to (De Asis et al 2020) that estimate a new kind of return: fixed horizon returns (i.e. the rewards only from the next k steps) instead of the typical discounted return.…”

Section: Introductionmentioning

confidence: 99%

Chaining Value Functions for Off-Policy Learning

Schmitt¹,

Shawe‐Taylor²,

Hasselt³

2022

Preprint

View full text Add to dashboard Cite

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn 'off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcementlearning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this onpolicy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -that we call a 'k-step expedition' -of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.

show abstract

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Cited by 12 publications

References 12 publications

Stabilizing deep Q-learning with Q-graph-based bounds

Stabilizing deep Q-learning with Q-graph-based bounds

(Deep) Reinforcement learning for electric power system control and related problems: A short review and perspectives

Chaining Value Functions for Off-Policy Learning

Contact Info

Product

Resources

About