2021
DOI: 10.48550/arxiv.2110.04184
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

Abstract: Multi-agent reinforcement learning has made substantial empirical progresses in solving games with a large number of players. However, theoretically, the best known sample complexity for finding a Nash equilibrium in general-sum games scales exponentially in the number of players due to the size of the joint action space, and there is a matching exponential lower bound. This paper investigates what learning goals admit better sample complexities in the setting of m-player general-sum Markov games with H steps,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

1
31
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 18 publications
(32 citation statements)
references
References 42 publications
1
31
0
Order By: Relevance
“…V-learning-initially coupled with the FTRL algorithm as adversarial bandit subroutine-is firstly proposed in the conference version of this paper [3], for finding Nash equilibria in the two-player zero-sum setting. During the preparation of this draft, we note two very recent independent works [47,32], whose results partially overlap with the results of this paper in the multiplayer general-sum setting. In particular, Mao and Bas ¸ar [32] use V-learning with stablized online mirror descent as adversarial bandit subroutine, and learn ǫ-CCE in O(H 6 SA/ǫ 2 ) episodes, where A = max j∈[m] A j .…”
Section: Related Worksupporting
confidence: 66%
See 1 more Smart Citation
“…V-learning-initially coupled with the FTRL algorithm as adversarial bandit subroutine-is firstly proposed in the conference version of this paper [3], for finding Nash equilibria in the two-player zero-sum setting. During the preparation of this draft, we note two very recent independent works [47,32], whose results partially overlap with the results of this paper in the multiplayer general-sum setting. In particular, Mao and Bas ¸ar [32] use V-learning with stablized online mirror descent as adversarial bandit subroutine, and learn ǫ-CCE in O(H 6 SA/ǫ 2 ) episodes, where A = max j∈[m] A j .…”
Section: Related Worksupporting
confidence: 66%
“…This is one H factor larger than what is required in Theorem 6 of this paper. Song et al [47] considers similar V-learning style algorithms for learning both ǫ-CCE and ǫ-CE. For the latter objective, they require O(H 6 SA 2 /ǫ 2 ) episodes which is again one H factor larger than what is required in Theorem 7 of this paper.…”
Section: Related Workmentioning
confidence: 99%
“…Zero-sum Markov game has been widely studied since the seminal work [Shapley, 1953]. When the transition kernel is unknown, different sampling oracles are utilized to acquire samples, including online sampling , Xie et al, 2020a, Liu et al, 2021, Jin et al, 2021a, Song et al, 2021, generative model sampling [Sidford et al, 2020, Cui and Yang, 2020, Zhang et al, 2020, Jia et al, 2019. For offline sampling oracle, Zhang et al [2021b] provides finite sample bound for a decentralized algorithm with network communication under uniform concentration assumption and Abe and Kaneko [2020] considers offline policy evaluation, again under the uniform concentration assumption.…”
Section: Related Workmentioning
confidence: 99%
“…However, the results for equilibrium computation in stochastic games are often much weaker than those for single-agent MDPs, mainly because of the nonstationary environment that is induced by players' decisions [14]. In general, there are strong lower bounds for computing stationary NE in stochastic games, which grows exponentially in terms of the number of players [20]. Therefore, prior work has largely focused on the special case of two-player zero-sum stochastic games [14], [21]- [23].…”
mentioning
confidence: 99%