2020
DOI: 10.48550/arxiv.2006.12007
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Near-Optimal Reinforcement Learning with Self-Play

Abstract: This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires Õ(S 2 AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
11
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 16 publications
0
11
0
Order By: Relevance
“…Bai and Jin [3] and Xie et al [56] develop the first provably-efficient learning algorithms in MGs based on optimistic value iteration. Bai et al [4] and Liu et al [34] improve upon these works and achieve best-known sample complexity for model-free and model-based methods, respectively. Several extensions are also studied, including multi-player general-sum MGs [34], unknown games 1 [48], vector-valued MGs [59], etc.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Bai and Jin [3] and Xie et al [56] develop the first provably-efficient learning algorithms in MGs based on optimistic value iteration. Bai et al [4] and Liu et al [34] improve upon these works and achieve best-known sample complexity for model-free and model-based methods, respectively. Several extensions are also studied, including multi-player general-sum MGs [34], unknown games 1 [48], vector-valued MGs [59], etc.…”
Section: Introductionmentioning
confidence: 99%
“…Despite the empirical success of MARL, existing theoretical guarantees in MARL only apply to the basic settings where the value functions can be represented by either tables (in cases where the states and actions are discrete) [3,4,34] or linear maps [56,13]. While a recent line of works [24,50,57,26,16] significantly advance our understanding of RL with general function approximation, and provide sample-efficient guarantees for RL with kernels, neural networks, rich observations, and several special cases of partial observability, they are all restricted to the singleagent setting.…”
Section: Introductionmentioning
confidence: 99%
“…(Zhang et al, 2020) considers simultaneous stochastic game, and their approach consists of solving a regularized simultaneous stochastic game, which is computationally costly. For the online sampling setting, a recent work (Bai et al, 2020) uses an upper confidence bound algorithms that can find an approximate Nash equilibrium strategy in O(|S||A||B|) steps.…”
Section: Lemma 9 Set U To Be a Set Of Equally Spaced Points Inmentioning
confidence: 99%
“…While both model-based and model-free algorithms have been shown to be provably efficient in multiagent RL in a recent line of work [2,37,3], a more precise understanding of the optimal sample complexities within these two types of algorithms (respectively) is still lacking. In the specific setting of two-player zerosum Markov games, the current best sample complexity for model-based algorithms is achieved by the VI-ULCB (Value Iteration with Upper/Lower Confidence Bounds) algorithm [2,37]: In a tabular Markov game with S states, {A, B} actions for the two players, and horizon length H, VI-ULCB is able to find an ǫ-approximate Nash equilibrium policy in Õ(H 4 S 2 AB/ǫ 2 ) episodes of game playing.…”
Section: Introductionmentioning
confidence: 99%
“…However, compared with the information-theoretic lower bound Ω(H 3 S(A + B)/ǫ 2 ), this rate has suboptimal dependencies on all of H, S, and A, B. In contrast, the current best sample complexity for model-free algorithms is achieved by Nash V-Learning [3], which finds an ǫ-approximate Nash policy in Õ(H 6 S(A + B)/ǫ 2 ) episodes. Compared with the lower bound, this is tight except for a poly(H) factor, which may seemingly suggest that model-free algorithms could be superior to model-based ones in multi-agent RL.…”
Section: Introductionmentioning
confidence: 99%