2021
DOI: 10.48550/arxiv.2110.05038
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Abstract: Many problems in RL, such as meta RL, robust RL, and generalization in RL, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(9 citation statements)
references
References 44 publications
0
8
0
Order By: Relevance
“…While TD3-LSTM shows good performance on high-dimensional sensor integration tasks, its network architecture is designed for POMDPs that are solvable with very short-term memory (e.g., 3-5 timesteps) and cannot be efficiently extended. A concurrent work [29] also combines TD3 and LSTM, but can be used with arbitrary memory length and is hence most similar to our work.…”
Section: Related Workmentioning
confidence: 93%
“…While TD3-LSTM shows good performance on high-dimensional sensor integration tasks, its network architecture is designed for POMDPs that are solvable with very short-term memory (e.g., 3-5 timesteps) and cannot be efficiently extended. A concurrent work [29] also combines TD3 and LSTM, but can be used with arbitrary memory length and is hence most similar to our work.…”
Section: Related Workmentioning
confidence: 93%
“…POMDPs can also be related to generalisation in RL and the robust RL [22] setting, which considers worst-case performance. While various specialised algorithms exist for these different problem settings, recent results from Ni et al [23] have shown that simple recurrent model-free RL agents are can perform well across the board. Motivated by this (and further related arguments by Schmidhuber [31]), we proceed by treating every environment as a POMDP, in which an agent attempts to learn using a general algorithmic framework.…”
Section: Generalised Upside Down Rl In Pomdpsmentioning
confidence: 99%
“…We propose uniting these directions to create general learning agents, with the position that RL can itself be framed as an SL problem. This is not a novel proposition [25,31,32,20,12,3,17,8,11], but in contrast to prior works, we provide a general framework that includes online RL, goal-conditioned RL (GCRL) [28], imitation learning (IL) [26], offline RL [9], and meta-RL [29], as well as other paradigms contained within partially observed Markov decision processes (POMDPs) [23]. We build upon the proposal of upside down RL (UDRL) by Schmidhuber [31], the implementation of Srivastava et al [32], and sequence modelling via Decision Transformers [3,17].…”
Section: Introductionmentioning
confidence: 99%
“…However, it is not immediately obvious which information will be of relevance later on, and all past observations are equally weighted when simply expanding the input vector. Finally, a further approach is to incorporate recurrency into the function approximators of model-free algorithms, which was shown to be capable of strong performances (Ni et al, 2021). The recurrency enriches the agent's decision making by extracting information of past observations, potentially yielding an improved ability to solve problems without access to the complete state vector.…”
Section: Long-short-term-memory (Lstm) Based Td3mentioning
confidence: 99%