TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning

Amiranashvili, Artemij; Dosovitskiy, Alexey; Koltun, Vladlen; Brox, Thomas

doi:10.48550/arxiv.1806.01175

Cited by 3 publications

(11 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reinforcement learning (RL) and deep reinforcement learning (DRL) are incredibly autonomous and interoperable, i.e., they have many real-time Internet of Things (IoT) applications. RL pertains to a machine learning method based on trial-and-error, which improves the performance by accepting feedback from the environment [1,2]. There have been many studies on applying RL or DRL in IoT, which are relevant to a variety of applications, such as energy demand based on the critical load or real-time electricity prices in a smart grid.…”

Section: Introductionmentioning

confidence: 99%

“…Robots or smart vehicles using IoT are autonomous in their working environment, wherein they attempt to find a collision-free path from the current location to the target. Regarding the applications of RL or DRL in autonomous IoT, a broad spectrum of technology exists, such as fast real-time decisions made locally in a vehicle or the transmission of data to and from the cloud [1,2]. In particular, one of the large issues regarding a real-time fast decision is online learning and decision making based on approximate results from the learning [1][2][3] similar to near-optimal path-planning with respect to real-time criteria.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Reinforcement Learning by Balancing Offline Monte Carlo and Online Temporal Difference Use Based on Environment Experiences

Kim

2020

Symmetry

View full text Add to dashboard Cite

Owing to the complexity involved in training an agent in a real-time environment, e.g., using the Internet of Things (IoT), reinforcement learning (RL) using a deep neural network, i.e., deep reinforcement learning (DRL) has been widely adopted on an online basis without prior knowledge and complicated reward functions. DRL can handle a symmetrical balance between bias and variance—this indicates that the RL agents are competently trained in real-world applications. The approach of the proposed model considers the combinations of basic RL algorithms with online and offline use based on the empirical balances of bias–variance. Therefore, we exploited the balance between the offline Monte Carlo (MC) technique and online temporal difference (TD) with on-policy (state-action–reward-state-action, Sarsa) and an off-policy (Q-learning) in terms of a DRL. The proposed balance of MC (offline) and TD (online) use, which is simple and applicable without a well-designed reward, is suitable for real-time online learning. We demonstrated that, for a simple control task, the balance between online and offline use without an on- and off-policy shows satisfactory results. However, in complex tasks, the results clearly indicate the effectiveness of the combined method in improving the convergence speed and performance in a deep Q-network.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Deep Reinforcement Learning by Balancing Offline Monte Carlo and Online Temporal Difference Use Based on Environment Experiences

Kim

2020

Symmetry

View full text Add to dashboard Cite

show abstract

“…As a result, the statistical accuracy of the value function learned by nonlinear TD remains unclear. In contrast to such conservative theory, neural TD, which straightforwardly combines TD with neural networks without the explicit local linearization in nonlinear TD, often learns a desired value function that generalizes well to unseen states in practice (Duan et al, 2016;Amiranashvili et al, 2018;Henderson et al, 2018). Hence, a gap separates theory from practice.…”

Section: Introductionmentioning

confidence: 99%

“…Such a reformulation leads to bilevel optimization, which is less stable in practice when combined with neural networks (Pfau and Vinyals, 2016). As a result, both extensions of TD are less widely used in deep reinforcement learning (Duan et al, 2016;Amiranashvili et al, 2018;Henderson et al, 2018). Moreover, when using neural networks for value function approximation, the convergence to the global optimum of MSPBE remains unclear for both extensions of TD.…”

Section: Introductionmentioning

confidence: 99%

Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima

Cai¹,

Yang²,

Lee³

et al. 2019

Preprint

View full text Add to dashboard Cite

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

show abstract

“…In Reinforcement Learning (RL), Temporal-Difference (TD) learning has become a design choice which is shared among the most successful algorithms that are present in the field [2]. Whether it is used in a Tabular-RL setting [11,33], or in combination with a function approximator [28,30], TD methods aim to learn a Value function, V , by directly bootstrapping their own experiences at different time-steps t. This is done with respect to a discount factor, γ, and a reward, r, which allow the computation of the TD-errors, r t + γV (s t+1 ) −V (s).…”

Section: Introductionmentioning

confidence: 99%

Deep Quality-Value (DQV) Learning

Sabatelli¹,

Louppe²,

Geurts³

et al. 2018

Preprint

View full text Add to dashboard Cite

We introduce a novel Deep Reinforcement Learning (DRL) algorithm called Deep Quality-Value (DQV) Learning. DQV uses temporal-difference learning to train a Value neural network and uses this network for training a second Quality-value network that learns to estimate state-action values. We first test DQV's update rules with Multilayer Perceptrons as function approximators on two classic RL problems, and then extend DQV with the use of Deep Convolutional Neural Networks, 'Experience Replay' and 'Target Neural Networks' for tackling four games of the Atari Arcade Learning environment. Our results show that DQV learns significantly faster and better than Deep Q-Learning and Double Deep Q-Learning, suggesting that our algorithm can potentially be a better performing synchronous temporal difference algorithm than what is currently present in DRL.

show abstract

TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning

Cited by 3 publications

References 6 publications

Deep Reinforcement Learning by Balancing Offline Monte Carlo and Online Temporal Difference Use Based on Environment Experiences

Deep Reinforcement Learning by Balancing Offline Monte Carlo and Online Temporal Difference Use Based on Environment Experiences

Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima

Deep Quality-Value (DQV) Learning

Contact Info

Product

Resources

About