2019
DOI: 10.48550/arxiv.1910.00125
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Meta-Q-Learning

Abstract: This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm for meta-Reinforcement Learning (meta-RL). MQL builds upon three simple ideas. First, we show that Q-learning is competitive with state of the art meta-RL algorithms if given access to a context variable that is a representation of the past trajectory. Second, using a multi-task objective to maximize the average reward across the training tasks is an effective method to meta-train RL policies. Third, past data from the meta-training replay… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
22
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 20 publications
(32 citation statements)
references
References 17 publications
0
22
0
Order By: Relevance
“…That is, the RL learning does not require the latest updated policy to interact with the environment. Rather, it can leverage experiences from other policies, such as learning from replaying buffer samples from old policy as used in [18]. Examples of this category include [19] [20] where their off-policy Meta-RL algorithms were developed by decoupling the task inference from the policy training.…”
Section: Related Workmentioning
confidence: 99%
“…That is, the RL learning does not require the latest updated policy to interact with the environment. Rather, it can leverage experiences from other policies, such as learning from replaying buffer samples from old policy as used in [18]. Examples of this category include [19] [20] where their off-policy Meta-RL algorithms were developed by decoupling the task inference from the policy training.…”
Section: Related Workmentioning
confidence: 99%
“…In the meta-testing phase, D test = {D (k) } N k=K+1 are sampled from the same task distribution. Although the meta-optimization approaches [14,23] have been successfully applied to various image classification tasks, their performance is relatively limited in RL tasks [13]. Recent advance of the context approach meta-RL [27] learns a latent representation of the task and construct a context model through recurrent networks [18,8].…”
Section: Preliminariesmentioning
confidence: 99%
“…Thereby, the learning efficiency will be limited [2]. In order to solve the limited state visitation problem in ET-MDP, we adopt the idea of context models, previously introduced in Meta-RL literature [13,27] to improve the generality of policies across different training tasks. In our setting of ET-MDP, a context variable is learned to improve the generality of the learned policy over different states, thus it enables our policy to perform safely over different states within one task.…”
Section: Introductionmentioning
confidence: 99%
“…There are many concrete formulations of meta-RL (see, e.g. (Wang et al, 2015;Duan et al, 2016;Houthooft et al, 2018;Rakelly et al, 2019;Zintgraf et al, 2019;Fakoor et al, 2019;Ortega et al, 2019;), Our focus is meta-RL through gradient-based adaptations (Finn et al, 2017), where the agent carries out policy gradient (PG) inner loop updates (Sutton et al, 2000) at both meta-training and meta-testing time.…”
Section: Introductionmentioning
confidence: 99%