2020
DOI: 10.1609/aaai.v34i04.6059
|View full text |Cite
|
Sign up to set email alerts
|

Discretizing Continuous Action Space for On-Policy Optimization

Abstract: In this work, we show that discretizing action space for continuous control is a simple yet powerful technique for on-policy optimization. The explosion in the number of discrete actions can be efficiently addressed by a policy with factorized distribution across action dimensions. We show that the discrete policy achieves significant performance gains with state-of-the-art on-policy optimization algorithms (PPO, TRPO, ACKTR) especially on high-dimensional tasks with complex dynamics. Additionally, we show tha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
24
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 46 publications
(24 citation statements)
references
References 18 publications
0
24
0
Order By: Relevance
“…For example, the control actions of an on-load tap changer with 3 tap positions that correspond to turns ratios of 0.95, 1, and 1.05 can be deemed as a discretization of an ordinal variable of turns ratio. Thus, we adopt an ordinal representation [45] for all the discrete actions of a device to encode the natural ordering between the discrete actions.…”
Section: F Device-decoupled Policy Network Structure and Ordinal Encmentioning
confidence: 99%
“…For example, the control actions of an on-load tap changer with 3 tap positions that correspond to turns ratios of 0.95, 1, and 1.05 can be deemed as a discretization of an ordinal variable of turns ratio. Thus, we adopt an ordinal representation [45] for all the discrete actions of a device to encode the natural ordering between the discrete actions.…”
Section: F Device-decoupled Policy Network Structure and Ordinal Encmentioning
confidence: 99%
“…, θ is the network parameter of the SPA model, π θ (a t |s t ) is the probability distribution of the policy under state s t and action a t at step t. We optimize the policy with minibatch AdamW. The estimated advantage function according to [66][67][68][69][70]…”
Section: Training Of Salient Patch Agentmentioning
confidence: 99%
“…We optimize the policy with minibatch AdamW. The estimated advantage function according to [ 66 , 67 , 68 , 69 , 70 ] is , where is the discount factor, is the SPA reward at step , is the number of steps of SPA. is the value output with under state .…”
Section: Asnet Frameworkmentioning
confidence: 99%
“…The use of molecular fragments simplifies the search problem, while the variable-sized fragment distribution maintains the reachability of most molecular compounds. Because our search algorithm ultimately uses the latent representation of the molecules as the action space, we find that using a VQ-VAE with a categorical prior instead of the typical Gaussian prior makes RL training stable and provides good performance gains (Tang & Agrawal, 2020;Grill et al, 2020).…”
Section: Learning Distributional Fragment Vocabularymentioning
confidence: 99%
“…Previous works optimize molecules either by generating from scratch or a single translation from known molecules, which is inefficient in finding high-quality molecules and often discovering molecules lacking novelty/diversity. Our proposed framework addresses these deficiencies since our method is (1) very efficient in finding molecules that satisfy property constraints as the model stay close to the high-property-score chemical manifold; and (2) able to produce highly novel molecules because the sequence of fragment-based translation can lead to very different and diverse molecules compared to the known active set.…”
Section: Introductionmentioning
confidence: 99%