Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Ni, Tianwei; Eysenbach, Benjamin; Salakhutdinov, Ruslan

doi:10.48550/arxiv.2110.05038

Cited by 8 publications

(9 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While TD3-LSTM shows good performance on high-dimensional sensor integration tasks, its network architecture is designed for POMDPs that are solvable with very short-term memory (e.g., 3-5 timesteps) and cannot be efficiently extended. A concurrent work [29] also combines TD3 and LSTM, but can be used with arbitrary memory length and is hence most similar to our work.…”

Section: Related Workmentioning

confidence: 93%

Recurrent Off-policy Baselines for Memory-based Continuous Control

Yang¹,

Nguyen²

2021

Preprint

View full text Add to dashboard Cite

When the environment is partially observable (PO), a deep reinforcement learning (RL) agent must learn a suitable temporal representation of the entire history in addition to a strategy to control. This problem is not novel, and there have been model-free and model-based algorithms proposed for this problem. However, inspired by recent success in model-free image-based RL, we noticed the absence of a model-free baseline for history-based RL that (1) uses full history and (2) incorporates recent advances in off-policy continuous control. Therefore, we implement recurrent versions of DDPG, TD3, and SAC (RDPG, RTD3, and RSAC) in this work, evaluate them on short-term and long-term PO domains, and investigate key design choices. Our experiments show that RDPG and RTD3 can surprisingly fail on some domains and that RSAC is the most reliable, reaching near-optimal performance on nearly all domains. However, one task that requires systematic exploration still proved to be difficult, even for RSAC. These results show that model-free RL can learn good temporal representation using only reward signals; the primary difficulty seems to be computational cost and exploration. To facilitate future research, we have made our PyTorch implementation publicly available 2 . * Equal contribution. 2 Code: https://github.com/zhihanyang2022/off-policy-continuous-control

show abstract

Section: Related Workmentioning

confidence: 93%

Recurrent Off-policy Baselines for Memory-based Continuous Control

Yang¹,

Nguyen²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…POMDPs can also be related to generalisation in RL and the robust RL [22] setting, which considers worst-case performance. While various specialised algorithms exist for these different problem settings, recent results from Ni et al [23] have shown that simple recurrent model-free RL agents are can perform well across the board. Motivated by this (and further related arguments by Schmidhuber [31]), we proceed by treating every environment as a POMDP, in which an agent attempts to learn using a general algorithmic framework.…”

Section: Generalised Upside Down Rl In Pomdpsmentioning

confidence: 99%

“…We propose uniting these directions to create general learning agents, with the position that RL can itself be framed as an SL problem. This is not a novel proposition [25,31,32,20,12,3,17,8,11], but in contrast to prior works, we provide a general framework that includes online RL, goal-conditioned RL (GCRL) [28], imitation learning (IL) [26], offline RL [9], and meta-RL [29], as well as other paradigms contained within partially observed Markov decision processes (POMDPs) [23]. We build upon the proposal of upside down RL (UDRL) by Schmidhuber [31], the implementation of Srivastava et al [32], and sequence modelling via Decision Transformers [3,17].…”

Section: Introductionmentioning

confidence: 99%

All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL

Arulkumaran¹,

Ashley²,

Schmidhuber³

et al. 2022

Preprint

View full text Add to dashboard Cite

Upside down reinforcement learning (UDRL) flips the conventional use of the return in the objective function in RL upside down, by taking returns as input and predicting actions. UDRL is based purely on supervised learning, and bypasses some prominent issues in RL: bootstrapping, off-policy corrections, and discount factors. While previous work with UDRL demonstrated it in a traditional online RL setting, here we show that this single algorithm can also work in the imitation learning and offline RL settings, be extended to the goal-conditioned RL setting, and even the meta-RL setting. With a general agent architecture, a single UDRL agent can learn across all paradigms.

show abstract

“…However, it is not immediately obvious which information will be of relevance later on, and all past observations are equally weighted when simply expanding the input vector. Finally, a further approach is to incorporate recurrency into the function approximators of model-free algorithms, which was shown to be capable of strong performances (Ni et al, 2021). The recurrency enriches the agent's decision making by extracting information of past observations, potentially yielding an improved ability to solve problems without access to the complete state vector.…”

Section: Long-short-term-memory (Lstm) Based Td3mentioning

confidence: 99%

Missing Velocity in Dynamic Obstacle Avoidance based on Deep Reinforcement Learning

Hart¹,

Waltz²,

Okhrin³

2021

Preprint

View full text Add to dashboard Cite

We introduce a novel approach to dynamic obstacle avoidance based on Deep Reinforcement Learning by defining a traffic type independent environment with variable complexity. Filling a gap in the current literature, we thoroughly investigate the effect of missing velocity information on an agent's performance in obstacle avoidance tasks. This is a crucial issue in practice since several sensors yield only positional information of objects or vehicles. We evaluate frequently-applied approaches in scenarios of partial observability, namely the incorporation of recurrency in the deep neural networks and simple frame-stacking. For our analysis, we rely on state-of-the-art model-free deep RL algorithms. The lack of velocity information is found to significantly impact the performance of an agent. Both approaches -recurrency and frame-stacking -cannot consistently replace missing velocity information in the observation space. However, in simplified scenarios, they can significantly boost performance and stabilize the overall training procedure.

show abstract

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Cited by 8 publications

References 44 publications

Recurrent Off-policy Baselines for Memory-based Continuous Control

Recurrent Off-policy Baselines for Memory-based Continuous Control

All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL

Missing Velocity in Dynamic Obstacle Avoidance based on Deep Reinforcement Learning

Contact Info

Product

Resources

About