Encyclopedia of Machine Learning 2011
DOI: 10.1007/978-0-387-30164-8_817
|View full text |Cite
|
Sign up to set email alerts
|

Temporal Difference Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2011
2011
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(7 citation statements)
references
References 8 publications
0
7
0
Order By: Relevance
“…It chooses the best action in each cycle except for a fixed fraction of the time when it tries a random action. We have implemented the Q(λ) algorithm (Watkins, 1989), which subsumes the simpler Q(0) algorithm as a special case, and also HLQ(λ) which is similar except that it automatically adapts its learning rate (Hutter and Legg, 2007). Finally, we have created a wrapper for MC-AIXI (Veness et al, 2010(Veness et al, , 2011, a more advanced reinforcement learning agent that can be viewed as an approximation to Hutter's AIXI.…”
Section: Resultsmentioning
confidence: 99%
“…It chooses the best action in each cycle except for a fixed fraction of the time when it tries a random action. We have implemented the Q(λ) algorithm (Watkins, 1989), which subsumes the simpler Q(0) algorithm as a special case, and also HLQ(λ) which is similar except that it automatically adapts its learning rate (Hutter and Legg, 2007). Finally, we have created a wrapper for MC-AIXI (Veness et al, 2010(Veness et al, , 2011, a more advanced reinforcement learning agent that can be viewed as an approximation to Hutter's AIXI.…”
Section: Resultsmentioning
confidence: 99%
“…By taking an action and moving from one state to another, based on the Bellman equation and Bellman update scheme [ 51 ], the value function is gradually updated using sample transitions. This procedure is referred to as Temporal Difference (TD) update [ 51 ]. There are two approaches to update policy: “on-policy learning” or “off-policy learning”.…”
Section: Problem Formulationmentioning
confidence: 99%
“…None of the existing methods for step-size adaptation in TD learning satisfy both of our criteria while also performing well in practice. HL(λ) (Hutter et al, 2007) and AlphaBound (Dabney and Barto, 2012) have a single step-size which only decreases in value. RMSprop (Tieleman and Hinton, 2012) satisfies our criteria and can be trivially generalized to TD, however, it does not perform well in TD Learning-as we demonstrate in this paper.…”
Section: Step-size Adaptation In Temporal-difference Learningmentioning
confidence: 99%