A Functional Clipping Approach for Policy Optimization Algorithms

Zhu, Wangshu; Rosendo, André

doi:10.1109/access.2021.3094566

Cited by 5 publications

(7 citation statements)

References 11 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since their performances generally decreased with stronger regularization methods, these results are probably due to too strong regularization of π to b for the replayed data. As pointed out in (Wang et al, 2020;Zhu and Rosendo, 2021), PPO has no capability to softly constrain π to b even with the too small threshold, and therefore, it yielded the near-optimal policy in the tasks except Swingup. In contrast to them, only PPO-RPE-A stably learned all the tasks.…”

Section: Results For Simple Tasksmentioning

confidence: 96%

“…{e,r}PPO-RB (Wang et al, 2020): η = 0.3 as the recommended value 3. {e,r}PPOS (Zhu and Rosendo, 2021): η = 0.3 as the recommended value 4. {e,r}PPO-RPE (Kobayashi, 2021a):…”

Section: Results For Simple Tasksmentioning

confidence: 99%

“…an upright state); as a result, sufficient exploration is required to get out of such a local solution. As pointed out in (Wang et al, 2020;Zhu and Rosendo, 2021), PPO does not softly constrain the latest policy to the baseline, and therefore, it may hold the exploration capability enough. Nevertheless, it is clear that given the regularization strength adaptively, the task accomplishment rate by PPO-RPE-A is better than that by PPO-RPE.…”

Section: Results For Complex Tasksmentioning

confidence: 99%

“…In PPO-RB (Wang et al, 2020), the density ratio over the clipping threshold is explicitly reverted to the clipping range for more certain regularization. PPOS (Zhu and Rosendo, 2021) relaxes the regularization in PPO-RB by weakening it according to the degree of exceedance from threshold. In the earlier work of this paper (Kobayashi, 2021a), a new regularization method has been proposed based on f-divergence to revise the ambiguous regularization caused by clipping.…”

Section: Introductionmentioning

confidence: 99%

“…1. PPO has no capability to make the latest policy softly constrain to the baseline one (Wang et al, 2020;Zhu and Rosendo, 2021). 2.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Proximal Policy Optimization with Adaptive Threshold for Symmetric Relative Density Ratio

Kobayashi¹

2022

Preprint

View full text Add to dashboard Cite

Deep reinforcement learning (DRL) is one of the promising approaches for introducing robots into complicated environments. The recent remarkable progress of DRL stands on regularization of policy, which allows the policy to improve stably and efficiently. A popular method, so-called proximal policy optimization (PPO), and its variants constrain density ratio of the latest and baseline policies when the density ratio exceeds a given threshold. This threshold can be designed relatively intuitively, and in fact its recommended value range has been suggested. However, the density ratio is asymmetric for its center, and the possible error scale from its center, which should be close to the threshold, would depend on how the baseline policy is given. In order to maximize the values of regularization of policy, this paper proposes a new PPO derived using relative Pearson (RPE) divergence, therefore so-called PPO-RPE, to design the threshold adaptively. In PPO-RPE, the relative density ratio, which can be formed with symmetry, replaces the raw density ratio. Thanks to this symmetry, its error scale from center can easily be estimated, hence, the threshold can be adapted for the estimated error scale. From three simple benchmark simulations, the importance of algorithm-dependent threshold design is revealed. By simulating additional four locomotion tasks, it is verified that the proposed method statistically contributes to task accomplishment by appropriately restricting the policy updates.

show abstract

Section: Results For Simple Tasksmentioning

confidence: 96%

“…{e,r}PPO-RB (Wang et al, 2020): η = 0.3 as the recommended value 3. {e,r}PPOS (Zhu and Rosendo, 2021): η = 0.3 as the recommended value 4. {e,r}PPO-RPE (Kobayashi, 2021a):…”

Section: Results For Simple Tasksmentioning

confidence: 99%

Section: Results For Complex Tasksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…1. PPO has no capability to make the latest policy softly constrain to the baseline one (Wang et al, 2020;Zhu and Rosendo, 2021). 2.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Proximal Policy Optimization with Adaptive Threshold for Symmetric Relative Density Ratio

Kobayashi¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

A Phase‐Change Memristive Reinforcement Learning for Rapidly Outperforming Champion Street‐Fighter Players

Go,

Jiang,

Loke

2023

Advanced Intelligent Systems

View full text Add to dashboard Cite

The interactions with humans, and simultaneously, making of real‐time decisions in physical systems, are involved in many applications of artificial intelligence. An example of these conditions is maneuver sports. Movement‐type simulations, viz., the esports game Street Fighter (SF), recapitulate the complex multicharacter interactions and, concurrently, generate the millisecond‐level control challenges of human athletes. Herein, the physical and mental signatures of the SF agent (it is called SF R2) are controlled by utilizing a previously unreported model‐free, natural, deep reinforcement learning algorithm “Decay‐based Phase‐change memristive character‐type Proximal Policy Optimization” (DP‐PPO) through an assemblage of hybrid case‐type training processes; and an integrated training configuration for time‐trial evaluations, as well as competitions with a world's best SF player, is developed. A short length of time utilized by the SF R2 to defeat the opponent and, simultaneously, maintaining a good health level is achieved, as well as excellent handling of imperfect information settings. Training studies reveal a moderate maneuver etiquette in the SF R2, along with rapid, effective head‐to‐head competitions with one of the world's best SF player. This paves the way for achieving a broadly applicable training scheme, capable of quickly controlling complicated‐movement systems in fields where agents should observe unspecified human norms.

show abstract