2022
DOI: 10.48550/arxiv.2204.02246
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Abstract: We present Reward-Switching Policy Optimization (RSPO), a paradigm to discover diverse strategies in complex RL environments by iteratively finding novel policies that are both locally optimal and sufficiently different from existing ones. To encourage the learning policy to consistently converge towards a previously undiscovered local optimum, RSPO switches between extrinsic and intrinsic rewards via a trajectory-based novelty measurement during the optimization process. When a sampled trajectory is sufficien… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(8 citation statements)
references
References 21 publications
0
4
0
Order By: Relevance
“…SMERL (Kumar et al 2020): SMERL maximizes a weighted combination of intrinsic rewards and extrinsic rewards when the return of extrinsic reward is greater than some given threshold. RSPO (Zhou et al 2022): RSPO is an iterative algorithm for discovering a diverse set of quality strategies. It toggles between extrinsic and intrinsic rewards based on a trajectorybased novelty measurement.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…SMERL (Kumar et al 2020): SMERL maximizes a weighted combination of intrinsic rewards and extrinsic rewards when the return of extrinsic reward is greater than some given threshold. RSPO (Zhou et al 2022): RSPO is an iterative algorithm for discovering a diverse set of quality strategies. It toggles between extrinsic and intrinsic rewards based on a trajectorybased novelty measurement.…”
Section: Methodsmentioning
confidence: 99%
“…However, their method operates in an unsupervised manner, without external rewards. More recently, RSPO (Zhou et al 2022) was proposed to derive diverse strategies. However, it requires multiple training stages, which results in poor sample efficiency -our method trains diverse strategies simultaneously which reduces sample complexity.…”
Section: Diversity In Reinforcement Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…Recent works have demonstrated that diversity-driven policies can extrapolate to new environments through the few-shot adaptation (Eysenbach et al 2018;Kumar et al 2020;Osa, Tangkaratt, and Sugiyama 2021;Parker-Holder et al 2020;Zhou et al 2022). While the policy population with different behavior characteristics can generalize to different environment variations, the learned policies may result in potential safety problems in practical scenarios like real-world systems, as the behaviors of the diverse policies are unpredictable.…”
Section: Introductionmentioning
confidence: 99%