2015
DOI: 10.48550/arxiv.1506.02438
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Abstract: Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of polic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
644
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 534 publications
(697 citation statements)
references
References 12 publications
1
644
0
Order By: Relevance
“…In the analysis, we introduce a slightly modified reward function instead of using the commonly applied function described in [24], because the existing reward function [24] is likely to result in a conservative policy in which the robot does not walk but remains in place. The existing reward function [24] can be defined by…”
Section: Reward Functionmentioning
confidence: 99%
“…In the analysis, we introduce a slightly modified reward function instead of using the commonly applied function described in [24], because the existing reward function [24] is likely to result in a conservative policy in which the robot does not walk but remains in place. The existing reward function [24] can be defined by…”
Section: Reward Functionmentioning
confidence: 99%
“…There are multiple choices of the advantage function [23], and we use the baseline version of the Monte-Carlo returns to reduce the variance: Evaluate MBPO and construct the reward as the average return: R i = Avg(η);…”
Section: Hyper-controller Learningmentioning
confidence: 99%
“…This formulation allows us to use advanced, actor-critic-type approaches [25] to improve the sample efficiency. In our implementation, we use generalized advantage estimation (GAE) [26] A(s, a 1 , ..., a n ) = Q(s, a 1 , ..., a n ) − V (s) in place of the Q function to calculate the gradient and Hessian terms. After sampling a batch of states, actions, and rewards from the replay buffer, we construct two pseudo-objectives, one for the first derivative terms and one for the mixed Hessian terms required for the PCGD update.…”
Section: Multiagent Reinforcement Learning (Marl)mentioning
confidence: 99%