2018
DOI: 10.48550/arxiv.1812.02648
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Deep Reinforcement Learning and the Deadly Triad

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
53
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 39 publications
(60 citation statements)
references
References 0 publications
2
53
0
Order By: Relevance
“…Intuitively, if 2 ( ) (the target) is changing faster (dictated by Θ 2 ) than the actual value 1 ( ) (dictated by Θ 1 ), learning will not converge. This result supports the divergence claim of standard deep Q-learning [14].…”
Section: Figuresupporting
confidence: 88%
See 1 more Smart Citation
“…Intuitively, if 2 ( ) (the target) is changing faster (dictated by Θ 2 ) than the actual value 1 ( ) (dictated by Θ 1 ), learning will not converge. This result supports the divergence claim of standard deep Q-learning [14].…”
Section: Figuresupporting
confidence: 88%
“…Deep Q-learning in its pure form often shows divergent behavior for function approximation [12,14]. It has no known convergence guarantees except for some similar algorithms where convergence results have been obtained [15].…”
Section: Introductionmentioning
confidence: 99%
“…Instead, we can approximate Q π with a learned function, Q θ (e.g., a deep network), with parameters θ [21,22,23]. Unfortunately, combining off-policy data, function approximation, and bootstrapping makes learning unstable and potentially diverge [1,2,14]. The problem arises when the parameters of the Q-network are updated to better approximate the Q-value of a state-action pair at the cost of worsening the approximation of other Q-values, including the ones used as targets.…”
Section: Q-value Estimationmentioning
confidence: 99%
“…Deep Q-learning (DQL) can be unstable when trained with both function-approximations and bootstrapping. These two comprise the two components of the so-called "deadly triad", with the off-policy learning being the third one [1,2]. The instability is largely due to the constantly changing target Q-values, e.g., due to the regular updating of the deep neural networks (DNNs) parameters, which are then constantly chased within a bootstrapping approach, overall yielding a regression problem of a non-stationary nature.…”
Section: Introductionmentioning
confidence: 99%
“…Zhang et al (2020b) provided a new variant of ETD, where the emphatic weights are estimated through function approximation. Van Hasselt et al (2018); Jiang et al (2021) studied ETD with deep neural function class. Comparison to concurrent work: During our preparation of this paper, a concurrent work (Zhang and Whiteson, 2021) was posted on arXiv, and proposed a truncated ETD (which we refer to as T-ETD for short here), which truncates the update of the follow-on trace to reduce the variance of ETD.…”
Section: Related Workmentioning
confidence: 99%