2021
DOI: 10.48550/arxiv.2102.12344
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Memory-based Deep Reinforcement Learning for POMDPs

Abstract: A promising characteristic of Deep Reinforcement Learning (DRL) is its capability to learn optimal policy in an end-to-end manner without relying on feature engineering. However, most approaches assume a fully observable state space, i.e. fully observable Markov Decision Process (MDP). In real-world robotics, this assumption is unpractical, because of the sensor issues such as sensors' capacity limitation and sensor noise, and the lack of knowledge about if the observation design is complete or not. These scen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(9 citation statements)
references
References 23 publications
2
7
0
Order By: Relevance
“…We use the term "standard" to refer to prior work that explicitly labels the problems studied as POMDPs. Common tasks include scenarios where the states are partially occluded [36], different states correspond to the same observation (perceptual aliasing [105]), random frames are dropped [35], observations use egocentric images [116], or the observations are perturbed with random noise [61]. These POMDPs often have hidden states that are non-stationary and affect both the rewards and the dynamics.…”
Section: Related Workmentioning
confidence: 99%
“…We use the term "standard" to refer to prior work that explicitly labels the problems studied as POMDPs. Common tasks include scenarios where the states are partially occluded [36], different states correspond to the same observation (perceptual aliasing [105]), random frames are dropped [35], observations use egocentric images [116], or the observations are perturbed with random noise [61]. These POMDPs often have hidden states that are non-stationary and affect both the rewards and the dynamics.…”
Section: Related Workmentioning
confidence: 99%
“…The recurrency enriches the agent's decision making by extracting information of past observations, potentially yielding an improved ability to solve problems without access to the complete state vector. Concretely, Meng et al (2021) proposed an extension of the TD3 called LSTM-TD3, which adds LSTM layers (Hochreiter and Schmidhuber, 1997) to actor and critic of the TD3. The resulting algorithm showed impressive results on several benchmark tasks from the continuous action domain.…”
Section: Long-short-term-memory (Lstm) Based Td3mentioning
confidence: 99%
“…o 0 is a zero-valued dummy observation of the same dimension as a regular observation. Note that the defintion of h l t slightly differs from Meng et al (2021) since we do not include past actions in the history. Furthermore, we set l = 2 throughout the paper, because, from a physical perspective, velocity and acceleration of an obstacle can be estimated based on its current and two last positions.…”
Section: Long-short-term-memory (Lstm) Based Td3mentioning
confidence: 99%
See 2 more Smart Citations