2018
DOI: 10.48550/arxiv.1806.01175
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning

Abstract: Our understanding of reinforcement learning (RL) has been shaped by theoretical and empirical results that were obtained decades ago using tabular representations and linear function approximators. These results suggest that RL methods that use temporal differencing (TD) are superior to direct Monte Carlo estimation (MC). How do these results hold up in deep RL, which deals with perceptually complex environments and deep nonlinear models? In this paper, we re-examine the role of TD in modern deep RL, using spe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
11
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(11 citation statements)
references
References 6 publications
0
11
0
Order By: Relevance
“…Reinforcement learning (RL) and deep reinforcement learning (DRL) are incredibly autonomous and interoperable, i.e., they have many real-time Internet of Things (IoT) applications. RL pertains to a machine learning method based on trial-and-error, which improves the performance by accepting feedback from the environment [1,2]. There have been many studies on applying RL or DRL in IoT, which are relevant to a variety of applications, such as energy demand based on the critical load or real-time electricity prices in a smart grid.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Reinforcement learning (RL) and deep reinforcement learning (DRL) are incredibly autonomous and interoperable, i.e., they have many real-time Internet of Things (IoT) applications. RL pertains to a machine learning method based on trial-and-error, which improves the performance by accepting feedback from the environment [1,2]. There have been many studies on applying RL or DRL in IoT, which are relevant to a variety of applications, such as energy demand based on the critical load or real-time electricity prices in a smart grid.…”
Section: Introductionmentioning
confidence: 99%
“…Robots or smart vehicles using IoT are autonomous in their working environment, wherein they attempt to find a collision-free path from the current location to the target. Regarding the applications of RL or DRL in autonomous IoT, a broad spectrum of technology exists, such as fast real-time decisions made locally in a vehicle or the transmission of data to and from the cloud [1,2]. In particular, one of the large issues regarding a real-time fast decision is online learning and decision making based on approximate results from the learning [1][2][3] similar to near-optimal path-planning with respect to real-time criteria.…”
Section: Introductionmentioning
confidence: 99%
“…As a result, the statistical accuracy of the value function learned by nonlinear TD remains unclear. In contrast to such conservative theory, neural TD, which straightforwardly combines TD with neural networks without the explicit local linearization in nonlinear TD, often learns a desired value function that generalizes well to unseen states in practice (Duan et al, 2016;Amiranashvili et al, 2018;Henderson et al, 2018). Hence, a gap separates theory from practice.…”
Section: Introductionmentioning
confidence: 99%
“…Such a reformulation leads to bilevel optimization, which is less stable in practice when combined with neural networks (Pfau and Vinyals, 2016). As a result, both extensions of TD are less widely used in deep reinforcement learning (Duan et al, 2016;Amiranashvili et al, 2018;Henderson et al, 2018). Moreover, when using neural networks for value function approximation, the convergence to the global optimum of MSPBE remains unclear for both extensions of TD.…”
Section: Introductionmentioning
confidence: 99%
“…In Reinforcement Learning (RL), Temporal-Difference (TD) learning has become a design choice which is shared among the most successful algorithms that are present in the field [2]. Whether it is used in a Tabular-RL setting [11,33], or in combination with a function approximator [28,30], TD methods aim to learn a Value function, V , by directly bootstrapping their own experiences at different time-steps t. This is done with respect to a discount factor, γ, and a reward, r, which allow the computation of the TD-errors, r t + γV (s t+1 ) −V (s).…”
Section: Introductionmentioning
confidence: 99%