2020
DOI: 10.48550/arxiv.2009.14180
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning to Play against Any Mixture of Opponents

Abstract: Intuitively, experience playing against one mixture of opponents in a given domain should be relevant for a different mixture in the same domain. We propose a transfer learning method, Q-Mixing, that starts by learning Q-values against each pure-strategy opponent. Then a Q-value for any distribution of opponent strategies is approximated by appropriately averaging the separately learned Q-values. From these components, we construct policies against all opponent mixtures without any further training. We empiric… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 14 publications
0
3
0
Order By: Relevance
“…In this setting, each policy is trained to best-respond to a meta-game pure strategy, rather than a mixture strategy as suggested by the meta-strategy solver. To approximately re-construct a best-response to the desired mixture strategy, Q-mixing (Smith et al, 2020b) re-weights expert policies, instead of retraining a new policy.…”
Section: Related Workmentioning
confidence: 99%
“…In this setting, each policy is trained to best-respond to a meta-game pure strategy, rather than a mixture strategy as suggested by the meta-strategy solver. To approximately re-construct a best-response to the desired mixture strategy, Q-mixing (Smith et al, 2020b) re-weights expert policies, instead of retraining a new policy.…”
Section: Related Workmentioning
confidence: 99%
“…Performing well against any mixtures over a set of policies has been previously studied in Smith et al (2020b), where approximate best-responses to mixture policies are constructed by combining Q-values of best-responses to individual mixture components. Despite promising empirical results, the combined policy does not optimise a Bayes-optimal objective explicitly and the quality of the approximation remains to be better understood.…”
Section: Related Workmentioning
confidence: 99%
“…At the end of training we are left with a diverse population of Solvers designed for different distributions, which we suspect may be combined to generate a powerful general Solver. There already exist several works on how to mix policies, such as Q-Mixing (Smith et al, 2020;, however we instead choose to combine the Solvers based on the meta-strategy. As we use the Nash equilibrium as our meta-solver, we can guarantee that our combined Solver has a given conservative level of performance under the assumption that these Instance distributions can be generated by the Data Generator's policy set.…”
Section: Combining the Solver Populationmentioning
confidence: 99%