Deep Reinforcement Learning and the Deadly Triad

Hasselt, Hado van; Doron, Yotam; Strub, Florian; Hessel, Matteo; Sonnerat, Nicolas; Modayil, Joseph

doi:10.48550/arxiv.1812.02648

Cited by 39 publications

(60 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Intuitively, if 2 ( ) (the target) is changing faster (dictated by Θ 2 ) than the actual value 1 ( ) (dictated by Θ 1 ), learning will not converge. This result supports the divergence claim of standard deep Q-learning [14].…”

Section: Figuresupporting

confidence: 88%

See 1 more Smart Citation

Deep Q-learning: a robust control approach

Varga¹,

Kulcsár²,

Chehreghani³

2022

Preprint

View full text Add to dashboard Cite

In this paper, we place deep Q-learning into a control-oriented perspective and study its learning dynamics with well-established techniques from robust control. We formulate an uncertain linear time-invariant model by means of the neural tangent kernel to describe learning. We show the instability of learning and analyze the agent's behavior in frequency-domain. Then, we ensure convergence via robust controllers acting as dynamical rewards in the loss function. We synthesize three controllers: state-feedback gain scheduling  2 , dynamic  ∞ , and constant gain  ∞ controllers.Setting up the learning agent with a control-oriented tuning methodology is more transparent and has well-established literature compared to the heuristics in reinforcement learning. In addition, our approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the  ∞ controlled learning performs slightly better than Double deep Q-learning.

show abstract

Section: Figuresupporting

confidence: 88%

“…Deep Q-learning in its pure form often shows divergent behavior for function approximation [12,14]. It has no known convergence guarantees except for some similar algorithms where convergence results have been obtained [15].…”

Section: Introductionmentioning

confidence: 99%

Deep Q-learning: a robust control approach

Varga¹,

Kulcsár²,

Chehreghani³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Instead, we can approximate Q π with a learned function, Q θ (e.g., a deep network), with parameters θ [21,22,23]. Unfortunately, combining off-policy data, function approximation, and bootstrapping makes learning unstable and potentially diverge [1,2,14]. The problem arises when the parameters of the Q-network are updated to better approximate the Q-value of a state-action pair at the cost of worsening the approximation of other Q-values, including the ones used as targets.…”

Section: Q-value Estimationmentioning

confidence: 99%

“…Deep Q-learning (DQL) can be unstable when trained with both function-approximations and bootstrapping. These two comprise the two components of the so-called "deadly triad", with the off-policy learning being the third one [1,2]. The instability is largely due to the constantly changing target Q-values, e.g., due to the regular updating of the deep neural networks (DNNs) parameters, which are then constantly chased within a bootstrapping approach, overall yielding a regression problem of a non-stationary nature.…”

Section: Introductionmentioning

confidence: 99%

Beyond Target Networks: Improving Deep $Q$-learning with Functional Regularization

Piche¹,

Thomas²,

Marino³

et al. 2021

Preprint

View full text Add to dashboard Cite

Target networks are at the core of recent success in Reinforcement Learning. They stabilize the training by using old parameters to estimate the Q-values, but this also limits the propagation of newly-encountered rewards which could ultimately slow down the training. In this work, we propose an alternative training method based on functional regularization which does not have this deficiency. Unlike target networks, our method uses up-to-date parameters to estimate the target Q-values, thereby speeding up training while maintaining stability. Surprisingly, in some cases, we can show that target networks are a special, restricted type of functional regularizers. Using this approach, we show empirical improvements in sample efficiency and performance across a range of Atari and simulated robotics environments.

show abstract

“…Zhang et al (2020b) provided a new variant of ETD, where the emphatic weights are estimated through function approximation. Van Hasselt et al (2018); Jiang et al (2021) studied ETD with deep neural function class. Comparison to concurrent work: During our preparation of this paper, a concurrent work (Zhang and Whiteson, 2021) was posted on arXiv, and proposed a truncated ETD (which we refer to as T-ETD for short here), which truncates the update of the follow-on trace to reduce the variance of ETD.…”

Section: Related Workmentioning

confidence: 99%

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Guan¹,

Xu²,

Liang

2021

Preprint

View full text Add to dashboard Cite

Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a successful method to conduct the off-policy value function evaluation with function approximation. Although ETD has been shown to converge asymptotically to a desirable value function, it is well-known that ETD often encounters a large variance so that its sample complexity can increase exponentially fast with the number of iterations. In this work, we propose a new ETD method, called PER-ETD (i.e., PEriodically Restarted-ETD), which restarts and updates the follow-on trace only for a finite period for each iteration of the evaluation parameter. Further, PER-ETD features a design of the logarithmical increase of the restart period with the number of iterations, which guarantees the best trade-off between the variance and bias and keeps both vanishing sublinearly. We show that PER-ETD converges to the same desirable fixed point as ETD, but improves the exponential sample complexity of ETD to be polynomials. Our experiments validate the superior performance of PER-ETD and its advantage over ETD.

show abstract

Deep Reinforcement Learning and the Deadly Triad

Cited by 39 publications

References 0 publications

Deep Q-learning: a robust control approach

Deep Q-learning: a robust control approach

Beyond Target Networks: Improving Deep $Q$-learning with Functional Regularization

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Contact Info

Product

Resources

About