Learning to Play against Any Mixture of Opponents

Smith, Max; Anthony, Thomas; Wang, Yongzhao; Wellman, Michael P.

doi:10.48550/arxiv.2009.14180

Cited by 3 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this setting, each policy is trained to best-respond to a meta-game pure strategy, rather than a mixture strategy as suggested by the meta-strategy solver. To approximately re-construct a best-response to the desired mixture strategy, Q-mixing (Smith et al, 2020b) re-weights expert policies, instead of retraining a new policy.…”

Section: Related Workmentioning

confidence: 99%

NeuPL: Neural Population Learning

Liu¹,

Marris²,

Hennes³

et al. 2022

Preprint

View full text Add to dashboard Cite

Learning in strategy games (e.g. StarCraft, poker) requires the discovery of diverse policies. This is often achieved by iteratively training new policies against existing ones, growing a policy population that is robust to exploit. This iterative approach suffers from two issues in real-world games: a) under finite budget, approximate best-response operators at each iteration needs truncating, resulting in under-trained good-responses populating the population; b) repeated learning of basic skills at each iteration is wasteful and becomes intractable in the presence of increasingly strong opponents. In this work, we propose Neural Population Learning (NeuPL) as a solution to both issues. NeuPL offers convergence guarantees to a population of best-responses under mild assumptions. By representing a population of policies within a single conditional model, NeuPL enables transfer learning across policies. Empirically, we show the generality, improved performance and efficiency of NeuPL across several test domains 1 . Most interestingly, we show that novel strategies become more accessible, not less, as the neural population expands. * Currently at Reality Labs, work carried out while at DeepMind. † Work carried out while at DeepMind. 1 See https://neupl.github.io/demo/ for supplementary illustrations. 2 This is formally quantified by Relative Population Performance, see Definition A.1 (Balduzzi et al., 2019).

show abstract

Section: Related Workmentioning

confidence: 99%

NeuPL: Neural Population Learning

Liu¹,

Marris²,

Hennes³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Performing well against any mixtures over a set of policies has been previously studied in Smith et al (2020b), where approximate best-responses to mixture policies are constructed by combining Q-values of best-responses to individual mixture components. Despite promising empirical results, the combined policy does not optimise a Bayes-optimal objective explicitly and the quality of the approximation remains to be better understood.…”

Section: Related Workmentioning

confidence: 99%

Simplex Neural Population Learning: Any-Mixture Bayes-Optimality in Symmetric Zero-sum Games

Liu¹,

Lanctot²,

Marris³

et al. 2022

Preprint

View full text Add to dashboard Cite

Learning to play optimally against any mixture over a diverse set of strategies is of important practical interests in competitive games. In this paper, we propose simplex-NeuPL that satisfies two desiderata simultaneously: i) learning a population of strategically diverse basis policies, represented by a single conditional network; ii) using the same network, learn best-responses to any mixture over the simplex of basis policies. We show that the resulting conditional policies incorporate prior information about their opponents effectively, enabling near optimal returns against arbitrary mixture policies in a game with tractable best-responses. We verify that such policies behave Bayes-optimally under uncertainty and offer insights in using this flexibility at test time. Finally, we offer evidence that learning bestresponses to any mixture policies is an effective auxiliary task for strategic exploration, which, by itself, can lead to more performant populations.

show abstract

“…At the end of training we are left with a diverse population of Solvers designed for different distributions, which we suspect may be combined to generate a powerful general Solver. There already exist several works on how to mix policies, such as Q-Mixing (Smith et al, 2020;, however we instead choose to combine the Solvers based on the meta-strategy. As we use the Nash equilibrium as our meta-solver, we can guarantee that our combined Solver has a given conservative level of performance under the assumption that these Instance distributions can be generated by the Data Generator's policy set.…”

Section: Combining the Solver Populationmentioning

confidence: 99%

A Game-Theoretic Approach for Improving Generalization Ability of TSP Solvers

Wang¹,

Yang²,

Oliver³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we shed new light on the generalization ability of deep learningbased solvers for Traveling Salesman Problems (TSP). Specifically, we introduce a two-player zero-sum framework between a trainable Solver and a Data Generator, where the Solver aims to solve the task instances provided by the Generator, and the Generator aims to generate increasingly difficult instances for improving the Solver. Grounded in Policy Space Response Oracle (PSRO) methods, our two-player framework outputs a population of best-responding Solvers, over which we can mix and output a combined model that achieves the least exploitability against the Generator, and thereby the most generalizable performance on different TSP tasks. We conduct experiments on a variety of TSP instances with different types and sizes. Results suggest that our Solvers achieve the state-of-the-art performance even on tasks the Solver never meets, whilst the performance of other deep learning-based Solvers drops sharply due to over-fitting. On real-world instances from TSPLIB, our method also attains a 12% improvement, in terms of optimal gap, over the best baseline model. To demonstrate the principle of our framework, we study the learning outcome of the proposed two-player game and demonstrate that the exploitability of the Solver population decreases during training, and it eventually approximates the Nash equilibrium along with the Generator.

show abstract

Learning to Play against Any Mixture of Opponents

Cited by 3 publications

References 14 publications

NeuPL: Neural Population Learning

NeuPL: Neural Population Learning

Simplex Neural Population Learning: Any-Mixture Bayes-Optimality in Symmetric Zero-sum Games

A Game-Theoretic Approach for Improving Generalization Ability of TSP Solvers

Contact Info

Product

Resources

About