V-Learning -- A Simple, Efficient, Decentralized Algorithm for Multiagent RL

Jin, Chi; Liu, Qinghua; Wang, Yuanhao; Yu, Tiancheng

doi:10.48550/arxiv.2110.14555

Cited by 20 publications

(40 citation statements)

References 36 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zero-sum Markov game has been widely studied since the seminal work [Shapley, 1953]. When the transition kernel is unknown, different sampling oracles are utilized to acquire samples, including online sampling , Xie et al, 2020a, Liu et al, 2021, Jin et al, 2021a, Song et al, 2021, generative model sampling [Sidford et al, 2020, Cui and Yang, 2020, Zhang et al, 2020, Jia et al, 2019. For offline sampling oracle, Zhang et al [2021b] provides finite sample bound for a decentralized algorithm with network communication under uniform concentration assumption and Abe and Kaneko [2020] considers offline policy evaluation, again under the uniform concentration assumption.…”

Section: Related Workmentioning

confidence: 99%

When is Offline Two-Player Zero-Sum Markov Game Solvable?

Cui¹,

Du²

2022

Preprint

View full text Add to dashboard Cite

We study what dataset assumption permits solving offline two-player zero-sum Markov game. In stark contrast to the offline single-agent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline two-player zero-sum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimism-type algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov game. Our work serves as an important initial step towards understanding offline multi-agent reinforcement learning. * While we assume deterministic rewards for simplicity, our results can be straightforwardly generalized to unknown stochastic rewards, as the major difficulty is in learning the transitions rather than learning the rewards.† Stochastic initial state is equivalent to an MDP with deterministic initial state by creating a dummy initial state which transit to the next state following that initial state distribution.

show abstract

Section: Related Workmentioning

confidence: 99%

When is Offline Two-Player Zero-Sum Markov Game Solvable?

Cui¹,

Du²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…(Wei et al, 2017;Xie et al, 2020;Liu et al, 2021;Chen et al, 2021;Jin et al, 2021b;Huang et al, 2021), as well as learning (Coarse) Correlated Equilibria in multi-player general-sum MGs, e.g. (Liu et al, 2021;Song et al, 2021;Jin et al, 2021a;Mao and Başar, 2022). As the settings of MGs in these work do not allow imperfect information, these results do not imply results for learning IIEFGs.…”

Section: Related Workmentioning

confidence: 87%

Near-Optimal Learning of Extensive-Form Games with Imperfect Information

Jin¹,

Song²,

Yu³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper resolves the open question of designing near-optimal algorithms for learning imperfectinformation extensive-form games from bandit feedback. We present the first line of algorithms that require only O((XA + Y B)/ε 2 ) episodes of play to find an ε-approximate Nash equilibrium in two-player zero-sum games, where X, Y are the number of information sets and A, B are the number of actions for the two players. This improves upon the best known sample complexity of O((X 2 A + Y 2 B)/ε 2 ) by a factor of O(max{X, Y }), and matches the information-theoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating balanced exploration policies into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games.

show abstract

“…MARL. There is a long line of research on the theoretical aspects of MARL, mainly focusing on MGs (Littman, 1994;Xie et al, 2020;Zhang et al, 2020;Liu et al, 2021;Jin et al, 2021). This literature is only partially related for two reasons: it aims to converge to an equilibrium rather than minimize individual regret, and MGs assume the agents share the current state while in our model different agents traverse different trajectories (i.e., modelling our setting in a MG requires exponentially large state space).…”

Section: Related Workmentioning

confidence: 99%

“…Cooperative multi-agent reinforcement learning (MARL; see Zhang et al (2021a)) achieved impressive empirical success in many applications such as cyber-physical systems (Adler & Blue, 2002;Wang et al, 2016), finance (Lee et al, 2002;2007) and sensor/communication networks (Cortes et al, 2004;Choi et al, 2009). The theoretical work on MARL has focused on either Markov Games (MGs) (Jin et al, 2021), where the goal is to converge to an equilibrium, or stochastic MDPs (Lidard et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Cooperative Online Learning in Stochastic and Adversarial MDPs

Lancewicki¹,

Rosenberg²,

Mansour³

2022

Preprint

View full text Add to dashboard Cite

We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, m agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: fresh -where each agent's trajectory is sampled i.i.d, and non-fresh -where the realization is shared by all agents (but each agent's trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state.We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.

show abstract

V-Learning -- A Simple, Efficient, Decentralized Algorithm for Multiagent RL

Cited by 20 publications

References 36 publications

When is Offline Two-Player Zero-Sum Markov Game Solvable?

When is Offline Two-Player Zero-Sum Markov Game Solvable?

Near-Optimal Learning of Extensive-Form Games with Imperfect Information

Cooperative Online Learning in Stochastic and Adversarial MDPs

Contact Info

Product

Resources

About