Memory-based Deep Reinforcement Learning for POMDPs

Meng, Lingheng; Gorbet, Rob; Kulić, Dana

doi:10.48550/arxiv.2102.12344

Cited by 3 publications

(9 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the term "standard" to refer to prior work that explicitly labels the problems studied as POMDPs. Common tasks include scenarios where the states are partially occluded [36], different states correspond to the same observation (perceptual aliasing [105]), random frames are dropped [35], observations use egocentric images [116], or the observations are perturbed with random noise [61]. These POMDPs often have hidden states that are non-stationary and affect both the rewards and the dynamics.…”

Section: Related Workmentioning

confidence: 99%

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Ni¹,

Eysenbach²,

Salakhutdinov³

2021

Preprint

View full text Add to dashboard Cite

Many problems in RL, such as meta RL, robust RL, and generalization in RL, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisions yield a recurrent model-free implementation that performs on par with (and occasionally substantially better than) more sophisticated recent techniques in their respective domains. We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Ni¹,

Eysenbach²,

Salakhutdinov³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The recurrency enriches the agent's decision making by extracting information of past observations, potentially yielding an improved ability to solve problems without access to the complete state vector. Concretely, Meng et al (2021) proposed an extension of the TD3 called LSTM-TD3, which adds LSTM layers (Hochreiter and Schmidhuber, 1997) to actor and critic of the TD3. The resulting algorithm showed impressive results on several benchmark tasks from the continuous action domain.…”

Section: Long-short-term-memory (Lstm) Based Td3mentioning

confidence: 99%

“…o 0 is a zero-valued dummy observation of the same dimension as a regular observation. Note that the defintion of h l t slightly differs from Meng et al (2021) since we do not include past actions in the history. Furthermore, we set l = 2 throughout the paper, because, from a physical perspective, velocity and acceleration of an obstacle can be estimated based on its current and two last positions.…”

Section: Long-short-term-memory (Lstm) Based Td3mentioning

confidence: 99%

“…The main objective is to analyze the performance of the algorithms when hiding velocity information in the observation of the agent. We try to formulate general Algorithm 1: LSTM-TD3 algorithm following Meng et al (2021).…”

Section: Problem Descriptionmentioning

confidence: 99%

“…Figure1: Illustration of the implemented LSTM-TD3 network architecture, adapted fromMeng et al (2021). MEM abbreviates memory extraction, CFE is current feature extraction, and PI refers to perception integration.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Missing Velocity in Dynamic Obstacle Avoidance based on Deep Reinforcement Learning

Hart¹,

Waltz²,

Okhrin³

2021

Preprint

View full text Add to dashboard Cite

We introduce a novel approach to dynamic obstacle avoidance based on Deep Reinforcement Learning by defining a traffic type independent environment with variable complexity. Filling a gap in the current literature, we thoroughly investigate the effect of missing velocity information on an agent's performance in obstacle avoidance tasks. This is a crucial issue in practice since several sensors yield only positional information of objects or vehicles. We evaluate frequently-applied approaches in scenarios of partial observability, namely the incorporation of recurrency in the deep neural networks and simple frame-stacking. For our analysis, we rely on state-of-the-art model-free deep RL algorithms. The lack of velocity information is found to significantly impact the performance of an agent. Both approaches -recurrency and frame-stacking -cannot consistently replace missing velocity information in the observation space. However, in simplified scenarios, they can significantly boost performance and stabilize the overall training procedure.

show abstract

A hybrid connectionist/LCS for hidden-state problems

Mitchell

2024

Neural Comput & Applic

View full text Add to dashboard Cite

This paper describes and evaluates the performance of a learning classifier system (lcs) inspired algorithm called Temporal Reinforcement And Classification Architecture (traca) on maze navigation tasks which contain hidden state. The evaluation of traca includes comparisons with other learning algorithms on selected difficult maze navigation tasks. Not all lcss are capable of learning all types of hidden-state mazes so traca is specifically compared against selected other lcs-based approaches that are most capable on these tasks, including xcsmh, AgentP (G), and AgentP (SA). Each algorithm is evaluated using a maze navigation task that has been identified as among the most difficult due to recurring aliased regions. The comparisons between algorithms include training time, test performance, and the size of the learned rule sets. The results indicate that each algorithm has its own advantages and drawbacks. For example, on the most difficult maze traca’s average steps to the goal are 10.1 while AgentP (G) are 7.87; however, traca requires an average of only 354 training trials compared with 537 for AgentP (G). Following the maze tasks, traca is also tested on two variations in a truck driving task where it must learn to navigate four lanes of slower vehicles while avoiding collisions. The results show that traca can achieve a low number of collisions with relatively few trials (as low as 24 collisions over 5000 time steps after 10,000 training time steps) but may require multiple network construction attempts to achieve high performance.

show abstract

Memory-based Deep Reinforcement Learning for POMDPs

Cited by 3 publications

References 23 publications

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Missing Velocity in Dynamic Obstacle Avoidance based on Deep Reinforcement Learning

A hybrid connectionist/LCS for hidden-state problems

Contact Info

Product

Resources

About