Near-Optimal Reinforcement Learning with Self-Play

Bai, Yu; Jin, Chi; Yu, Tiancheng

doi:10.48550/arxiv.2006.12007

Cited by 7 publications

(11 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bai and Jin [3] and Xie et al [56] develop the first provably-efficient learning algorithms in MGs based on optimistic value iteration. Bai et al [4] and Liu et al [34] improve upon these works and achieve best-known sample complexity for model-free and model-based methods, respectively. Several extensions are also studied, including multi-player general-sum MGs [34], unknown games 1 [48], vector-valued MGs [59], etc.…”

Section: Introductionmentioning

confidence: 99%

“…Despite the empirical success of MARL, existing theoretical guarantees in MARL only apply to the basic settings where the value functions can be represented by either tables (in cases where the states and actions are discrete) [3,4,34] or linear maps [56,13]. While a recent line of works [24,50,57,26,16] significantly advance our understanding of RL with general function approximation, and provide sample-efficient guarantees for RL with kernels, neural networks, rich observations, and several special cases of partial observability, they are all restricted to the singleagent setting.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces

Jin,

Liu,

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Modern reinforcement learning (RL) commonly engages practical problems with large state spaces, where function approximation must be deployed to approximate either the value function or the policy. While recent progresses in RL theory address a rich set of RL problems with general function approximation, such successes are mostly restricted to the single-agent setting. It remains elusive how to extend these results to multi-agent RL, especially in the face of new game-theoretical challenges. This paper considers two-player zero-sum Markov Games (MGs). We propose a new algorithm that can provably find the Nash equilibrium policy using a polynomial number of samples, for any MG with low multi-agent Bellman-Eluder dimension-a new complexity measure adapted from its single-agent version [26]. A key component of our new algorithm is the exploiter, which facilitates the learning of the main player by deliberately exploiting her weakness. Our theoretical framework is generic, which applies to a wide range of models including but not limited to tabular MGs, MGs with linear or kernel function approximation, and MGs with rich observations.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces

Jin,

Liu,

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…(Zhang et al, 2020) considers simultaneous stochastic game, and their approach consists of solving a regularized simultaneous stochastic game, which is computationally costly. For the online sampling setting, a recent work (Bai et al, 2020) uses an upper confidence bound algorithms that can find an approximate Nash equilibrium strategy in O(|S||A||B|) steps.…”

Section: Lemma 9 Set U To Be a Set Of Equally Spaced Points Inmentioning

confidence: 99%

Minimax Sample Complexity for Turn-based Stochastic Game

Cui,

Yang

2020

Preprint

View full text Add to dashboard Cite

The empirical success of Multi-agent reinforcement learning is encouraging, while few theoretical guarantees have been revealed. In this work, we prove that the plug-in solver approach, probably the most natural reinforcement learning algorithm, achieves minimax sample complexity for turnbased stochastic game (TBSG). Specifically, we plan in an empirical TBSG by utilizing a 'simulator' that allows sampling from arbitrary state-action pair. We show that the empirical Nash equilibrium strategy is an approximate Nash equilibrium strategy in the true TBSG and give both problemdependent and problem-independent bound. We develop absorbing TBSG and reward perturbation techniques to tackle the complex statistical dependence. The key idea is artificially introducing a suboptimality gap in TBSG and then the Nash equilibrium strategy lies in a finite set.

show abstract

“…While both model-based and model-free algorithms have been shown to be provably efficient in multiagent RL in a recent line of work [2,37,3], a more precise understanding of the optimal sample complexities within these two types of algorithms (respectively) is still lacking. In the specific setting of two-player zerosum Markov games, the current best sample complexity for model-based algorithms is achieved by the VI-ULCB (Value Iteration with Upper/Lower Confidence Bounds) algorithm [2,37]: In a tabular Markov game with S states, {A, B} actions for the two players, and horizon length H, VI-ULCB is able to find an ǫ-approximate Nash equilibrium policy in Õ(H 4 S 2 AB/ǫ 2 ) episodes of game playing.…”

Section: Introductionmentioning

confidence: 99%

“…However, compared with the information-theoretic lower bound Ω(H 3 S(A + B)/ǫ 2 ), this rate has suboptimal dependencies on all of H, S, and A, B. In contrast, the current best sample complexity for model-free algorithms is achieved by Nash V-Learning [3], which finds an ǫ-approximate Nash policy in Õ(H 6 S(A + B)/ǫ 2 ) episodes. Compared with the lower bound, this is tight except for a poly(H) factor, which may seemingly suggest that model-free algorithms could be superior to model-based ones in multi-agent RL.…”

Section: Introductionmentioning

confidence: 99%

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

Liu,

Yu,

Bai

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Model-based algorithms-algorithms that decouple learning of the model and planning given the model-are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm Optimistic Nash Value Iteration (Nash-VI) for two-player zerosum Markov games that is able to output an ǫ-approximate Nash policy in Õ(H 3 SAB/ǫ 2 ) episodes of game playing, where S is the number of states, A, B are the number of actions for the two players respectively, and H is the horizon length. This is the first algorithm that matches the information-theoretic lower bound Ω(H 3 S(A + B)/ǫ 2 ) except for a min {A, B} factor, and compares favorably against the best known model-free algorithm if min {A, B} = o(H 3 ). In addition, our Nash-VI outputs a single Markov policy with optimality guarantee, while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zerosum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

show abstract

Near-Optimal Reinforcement Learning with Self-Play

Cited by 7 publications

References 16 publications

The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces

The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces

Minimax Sample Complexity for Turn-based Stochastic Game

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

Contact Info

Product

Resources

About