Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Zhou, Zihan; Fu, Wei; Zhang, Bingliang; Wu, Yi

doi:10.48550/arxiv.2204.02246

Cited by 2 publications

(8 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SMERL (Kumar et al 2020): SMERL maximizes a weighted combination of intrinsic rewards and extrinsic rewards when the return of extrinsic reward is greater than some given threshold. RSPO (Zhou et al 2022): RSPO is an iterative algorithm for discovering a diverse set of quality strategies. It toggles between extrinsic and intrinsic rewards based on a trajectorybased novelty measurement.…”

Section: Methodsmentioning

confidence: 99%

“…However, their method operates in an unsupervised manner, without external rewards. More recently, RSPO (Zhou et al 2022) was proposed to derive diverse strategies. However, it requires multiple training stages, which results in poor sample efficiency -our method trains diverse strategies simultaneously which reduces sample complexity.…”

Section: Diversity In Reinforcement Learningmentioning

confidence: 99%

“…2. The task inherently allows multiple optimal solutions, such as a maze offering two equally efficient paths (Osa, Tangkaratt, and Sugiyama 2021;Zhou et al 2022).…”

Section: Introductionmentioning

confidence: 99%

“…Sample Efficiency: Unlike predecessors like RSPO (Zhou et al 2022) and RPG (Tang et al 2021), which necessitated multiple networks and training phases, DGPO employs a shared network for concurrent learning of strategies, resulting in superior sample efficiency.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Chen,

Huang,

Chiang

et al. 2024

AAAI

View full text Add to dashboard Cite

Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Diversity In Reinforcement Learningmentioning

confidence: 99%

“…2. The task inherently allows multiple optimal solutions, such as a maze offering two equally efficient paths (Osa, Tangkaratt, and Sugiyama 2021;Zhou et al 2022).…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Chen,

Huang,

Chiang

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Recent works have demonstrated that diversity-driven policies can extrapolate to new environments through the few-shot adaptation (Eysenbach et al 2018;Kumar et al 2020;Osa, Tangkaratt, and Sugiyama 2021;Parker-Holder et al 2020;Zhou et al 2022). While the policy population with different behavior characteristics can generalize to different environment variations, the learned policies may result in potential safety problems in practical scenarios like real-world systems, as the behaviors of the diverse policies are unpredictable.…”

Section: Introductionmentioning

confidence: 99%

Open-Ended Diverse Solution Discovery with Regulated Behavior Patterns for Cross-Domain Adaptation

Wei

et al. 2023

AAAI

View full text Add to dashboard Cite

While Reinforcement Learning can achieve impressive results for complex tasks, the learned policies are generally prone to fail in downstream tasks with even minor model mismatch or unexpected perturbations. Recent works have demonstrated that a policy population with diverse behavior characteristics can generalize to downstream environments with various discrepancies. However, such policies might result in catastrophic damage during the deployment in practical scenarios like real-world systems due to the unrestricted behaviors of trained policies. Furthermore, training diverse policies without regulation of the behavior can result in inadequate feasible policies for extrapolating to a wide range of test conditions with dynamics shifts. In this work, we aim to train diverse policies under the regularization of the behavior patterns. We motivate our paradigm by observing the inverse dynamics in the environment with partial state information and propose Diversity in Regulation (DiR) training diverse policies with regulated behaviors to discover desired patterns that benefit the generalization. Considerable empirical results on various variations of different environments indicate that our method attains improvements over other diversity-driven counterparts.

show abstract

Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Cited by 2 publications

References 21 publications

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Open-Ended Diverse Solution Discovery with Regulated Behavior Patterns for Cross-Domain Adaptation

Contact Info

Product

Resources

About