2020
DOI: 10.1007/978-3-030-61616-8_25
|View full text |Cite
|
Sign up to set email alerts
|

Understanding Failures of Deterministic Actor-Critic with Continuous Action Spaces and Sparse Rewards

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(6 citation statements)
references
References 2 publications
0
4
0
Order By: Relevance
“…Both DDPG and TD3, despite their overall good behavior on average, still exhibit specific flaws in their optimization process. For instance, Matheron et al (2020) illustrated how sparse rewards in deterministic environments could prevent altogether the convergence of these methods by inducing value function plateaus and zero-gradient updates. A state-of-the-art alternative is SAC, which makes the actor policy stochastic (the behavior policy is then the actor policy), forces this policy to imitate the soft-max of (instead of the hard-max in DDPG and TD3), and introduces an entropy regularization term in the resolution of the Bellman equation to make the optimization landscape smoother (Geist et al, 2019).…”
Section: Reinforcement Learning For Airfoil Trajectory Controlmentioning
confidence: 99%
“…Both DDPG and TD3, despite their overall good behavior on average, still exhibit specific flaws in their optimization process. For instance, Matheron et al (2020) illustrated how sparse rewards in deterministic environments could prevent altogether the convergence of these methods by inducing value function plateaus and zero-gradient updates. A state-of-the-art alternative is SAC, which makes the actor policy stochastic (the behavior policy is then the actor policy), forces this policy to imitate the soft-max of (instead of the hard-max in DDPG and TD3), and introduces an entropy regularization term in the resolution of the Bellman equation to make the optimization landscape smoother (Geist et al, 2019).…”
Section: Reinforcement Learning For Airfoil Trajectory Controlmentioning
confidence: 99%
“…In a (D)RL framework, the representation of a reward function is critical, as it quantifies the value associated with each state and action pair and assists the agent in learning an optimal policy. [26] remarks that (D)RL with sparse rewards can lead to instability and suboptimal policy convergence. Likewise, each reward component should be weighted optimally in a multi-objective DRL agent to achieve the desired outcome and faster convergence.…”
Section: Related Workmentioning
confidence: 99%
“…They show that the episode reward of the EIDM‐guided DDPG converges to a steady value much faster than that of DDPG, which indicates that the EIDM‐guided DDPG can effectively improve the sample efficiency, training stability, and convergence of DDPG. The reason why DDPG performs poorly and cannot converge in a stable manner could be due to being stuck in a local optimum with poor solutions during the training (Matheron et al., 2019), which entails more human efforts to revise hyperparameters and DNN structures to improve the training performance. Figure 8c shows the episode rewards of the EIDM‐guided DDPG for the remaining three cases (i.e., two, three, and four preceding vehicles).…”
Section: Model Trainingmentioning
confidence: 99%