Novel Policy Seeking with Constrained Optimization

Sun, Hao; Peng, Zhonghua; Dai, Bo; Guo, Jian; Lin, Dahua; Zhou, Bolei

doi:10.48550/arxiv.2005.10696

Cited by 6 publications

(8 citation statements)

References 32 publications

(41 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By contrast, our method utilizes a filtering-based objective via reward switching to strictly enforce all the diversity constraints. Sun et al (2020) adopts a conceptually similar objective by early terminating episodes that do not incur sufficient novelty. However, Sun et al (2020) does not leverage any exploration technique for those rejected samples and may easily suffer from low sample efficiency in challenging RL tasks we consider in this paper.…”

Section: Related Workmentioning

confidence: 99%

“…Sun et al (2020) adopts a conceptually similar objective by early terminating episodes that do not incur sufficient novelty. However, Sun et al (2020) does not leverage any exploration technique for those rejected samples and may easily suffer from low sample efficiency in challenging RL tasks we consider in this paper. There is another concurrent work with an orthogonal focus, which directly optimizes diversity with reward constraints (Zahavy et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Zhou¹,

Fu²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present Reward-Switching Policy Optimization (RSPO), a paradigm to discover diverse strategies in complex RL environments by iteratively finding novel policies that are both locally optimal and sufficiently different from existing ones. To encourage the learning policy to consistently converge towards a previously undiscovered local optimum, RSPO switches between extrinsic and intrinsic rewards via a trajectory-based novelty measurement during the optimization process. When a sampled trajectory is sufficiently distinct, RSPO performs standard policy optimization with extrinsic rewards. For trajectories with high likelihood under existing policies, RSPO utilizes an intrinsic diversity reward to promote exploration. Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent particle-world tasks and MuJoCo continuous control to multi-agent stag-hunt games and StarCraftII challenges.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Zhou¹,

Fu²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently a variety of DRL-based learning methods have been proposed to discover diverse control policies in machine learning, e.g., [Achiam et al 2018;Conti et al 2018;Eysenbach et al 2019;Haarnoja et al 2018;Hester and Stone 2017;Houthooft et al 2016;Schmidhuber 1991;Sharma et al 2019;Sun et al 2020;]. These methods mainly encourage exploration of unseen states or actions by jointly optimizing the task and novelty objectives , or optimizing intrinsic rewards such as heuristically defined curiosity terms [Eysenbach et al 2019;Sharma et al 2019].…”

Section: Diversity Optimizationmentioning

confidence: 99%

“…Our novel policy search is in principle similar to the idea of [Sun et al 2020;]. However, there are two key differences.…”

Section: Stage 2: Novel Policy Seekingmentioning

confidence: 99%

Discovering diverse athletic jumping strategies

et al. 2021

View full text Add to dashboard Cite

We present a framework that enables the discovery of diverse and natural-looking motion strategies for athletic skills such as the high jump. The strategies are realized as control policies for physics-based characters. Given a task objective and an initial character configuration, the combination of physics simulation and deep reinforcement learning (DRL) provides a suitable starting point for automatic control policy training. To facilitate the learning of realistic human motions, we propose a Pose Variational Autoencoder (P-VAE) to constrain the actions to a subspace of natural poses. In contrast to motion imitation methods, a rich variety of novel strategies can naturally emerge by exploring initial character states through a sample-efficient Bayesian diversity search (BDS) algorithm. A second stage of optimization that encourages novel policies can further enrich the unique strategies discovered. Our method allows for the discovery of diverse and novel strategies for athletic jumping motions such as high jumps and obstacle jumps with no motion examples and less reward engineering than prior work.

show abstract

“…( 7)). Sun et al [38] also investigated CMDPs, but focused on the setup where the diversity reward has to satisfy a constraint, so the diversity reward is r e and the extrinsic reward is r d . But most importantly, we use a different method to solve CMDPs, which is based on Lagrange multipliers and SFs and is justified from CMDP theory [3,8,7], while these other two papers use techniques that are not guaranteed to solve CMDPs.…”

Section: Solving the Constrained Mdpmentioning

confidence: 99%

Discovering Diverse Nearly Optimal Policies with Successor Features

Zahavy¹,

O’Donoghue²,

Barreto³

et al. 2021

Preprint

View full text Add to dashboard Cite

Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.Preprint. Under review.

show abstract

Novel Policy Seeking with Constrained Optimization

Cited by 6 publications

References 32 publications

Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Discovering diverse athletic jumping strategies

Discovering Diverse Nearly Optimal Policies with Successor Features

Contact Info

Product

Resources

About