Breaking the Deadly Triad with a Target Network

Zhang, Shangtong; Yao, Hengshuai; Whiteson, Shimon

doi:10.48550/arxiv.2101.08862

Cited by 4 publications

(9 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Learning rate for Greedy GQ (GGQ) and Coupled Q Learning (CQL) are set as 0.05 and 0.25, respectively as in Carvalho et al, 2020;Maei et al, 2010. Since CQL requires normalized feature values, we scaled the feature value with 1 2 as in Carvalho et al, 2020, and initialized weights as one. We implemented Q-learning with target network (Zhang et al, 2021) without projection for practical reason (Qtarget). We set the learning rate as 0.25 and 0.05 respectively, and the weight η as two.…”

Section: Methodsmentioning

confidence: 99%

“…Carvalho et al, 2020;, 2021 assume ||x(s, a)|| ∞ ≤ 1 for all (s, a) ∈ S × A. Moreover, Zhang et al 2021 requires specific bounds on the feature matrix which is dependent on various factors e.g. projection radius and transition matrix .…”

Section: Q-learning With Linear Function Approximationmentioning

confidence: 99%

“…Yang & Wang, 2019 has a stringent assumption on anchor state-action pairs. There are few works, Agarwal et al, 2021;Carvalho et al, 2020;Zhang et al, 2021;Maei et al, 2010, that guarantee convergence under more general assumptions.…”

Section: Introductionmentioning

confidence: 99%

“…Carvalho et al, 2020 uses a two time-scale learning method, and has a strong assumption on the boundedness of the feature matrix. Zhang et al, 2021 used l 2 regularization with the target network, while a projection step is involved, which makes it difficult to implement practically. Moreover, it also uses a two time-scale learning method.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Regularized Q-learning

Lim¹,

Kim²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

Q-learning is widely used algorithm in reinforcement learning community. Under the lookup table setting, its convergence is well established. However, its behavior is known to be unstable with the linear function approximation case. This paper develops a new Q-learning algorithm that converges when linear function approximation is used. We prove that simply adding an appropriate regularization term ensures convergence of the algorithm. We prove its stability using a recent analysis tool based on switching system models. Moreover, we experimentally show that it converges in environments where Q-learning with linear function approximation has known to diverge. We also provide an error bound on the solution where the algorithm converges.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Q-learning With Linear Function Approximationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Regularized Q-learning

Lim¹,

Kim²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Despite their resounding empirical success in deep RL, a theoretical understanding of the use of target networks in actor-critic methods is largely missing in the literature. Theoretical contributions investigating the use of a target network are very recent and limited to temporal difference (TD) learning for policy evaluation [23] and critic-only methods such as Q-learning for control [48]. In particular, these works are not concerned with actor-critic algorithms and leave the question of the finite-time analysis open.…”

Section: Introductionmentioning

confidence: 99%

Analysis of a Target-Based Actor-Critic Algorithm with Linear Function Approximation

Barakat¹,

Bianchi²,

Lehmann³

2021

Preprint

View full text Add to dashboard Cite

Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning. However, a theoretical understanding of the use of target networks in actor-critic methods is largely missing in the literature. In this paper, we bridge this gap between theory and practice by proposing the first theoretical analysis of an online target-based actor-critic algorithm with linear function approximation in the discounted reward setting. Our algorithm uses three different timescales: one for the actor and two for the critic. Instead of using the standard single timescale temporal difference (TD) learning algorithm as a critic, we use a two timescales target-based version of TD learning closely inspired from practical actor-critic algorithms implementing target networks. First, we establish asymptotic convergence results for both the critic and the actor under Markovian sampling. Then, we provide a finite-time analysis showing the impact of incorporating a target network into actor-critic methods.

show abstract

Deep Residual Reinforcement Learning (Extended Abstract)

Zhang

Boehmer

Whiteson

2021

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in commonly used benchmarks. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD(k) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost.

show abstract

Breaking the Deadly Triad with a Target Network

Cited by 4 publications

References 27 publications

Regularized Q-learning

Regularized Q-learning

Analysis of a Target-Based Actor-Critic Algorithm with Linear Function Approximation

Deep Residual Reinforcement Learning (Extended Abstract)

Contact Info

Product

Resources

About