Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

Xie, Qiaomin; Chen, Yudong; Wang, Zhaoran; Yang, Zhuoran

doi:10.48550/arxiv.2002.07066

Cited by 10 publications

(42 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We propose an algorithm that incurs at most O(H √ d E K log N F ) 1 regret in K episodes where d E denotes the Minimax Eluder dimension N F denotes the covering number of function class. As a special case, this result improves [XCWY20] by a √ d multiplicative factor when the reward function and transition kernel are linearly parameterized and d is the dimension of feature mapping.…”

Section: Introductionmentioning

confidence: 86%

“…However, without strong sampling model or a well explored policy, the issue of exploration-exploitation tradeoff must addressed. Most of these work focus on tabular setting [WHL17, PSPP17, BJ20, BJWX21, BJY20, ZKBY20, ZTLD21] or linear function approximation settings [XCWY20]. Moreover, the concurrent work of [JLY21] studies competitive RL with general function approximation.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Towards General Function Approximation in Zero-Sum Markov Games

Huang

Lee

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorithms for both decoupled and coordinated settings are developed. In the decoupled setting where the agent controls a single player and plays against an arbitrary opponent, we propose a new model-free algorithm. The sample complexity is governed by the Minimax Eluder dimension-a new dimension of the function class in Markov games. As a special case, this method improves the state-of-the-art algorithm by a √ d factor in the regret when the reward function and transition kernel are parameterized with d-dimensional linear features. In the coordinated setting where both players are controlled by the agent, we propose a model-based algorithm and a modelfree algorithm. In the model-based algorithm, we prove that sample complexity can be bounded by a generalization of Witness rank to Markov games. The model-free algorithm enjoys a √ K-regret upper bound where K is the number of episodes. Our algorithms are based on new techniques of alternate optimism.

show abstract

Section: Introductionmentioning

confidence: 86%

Section: Related Workmentioning

confidence: 99%

Towards General Function Approximation in Zero-Sum Markov Games

Huang

Lee

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…While several provable decentralized MARL algorithms have been developed [see, e.g., 57,40,13], they either have only asymptotic guarantees or work only under certain reachability assumptions (see Section 1.1). The existing provably efficient algorithms for general Markov games (without further assumptions) are exclusively centralized algorithms [2,55,30].…”

Section: Objectivementioning

confidence: 99%

“…In this section, we focus our attention on theoretical results for the tabular setting, where the numbers of states and actions are finite. We acknowledge that there has been much recent work in RL for continuous state spaces [see, e.g., 21,23,56,24,55,25], but this setting is beyond our scope.…”

Section: Related Workmentioning

confidence: 99%

“…A recent line of works provide non-asymptotic guarantees for learning two-player zero-sum tabular MGs without further structural assumptions. Bai and Jin [2] and Xie et al [55] develop the first provably-efficient learning algorithms in MGs based on optimistic value iteration. Liu et al [30] improves upon these works and achieve best-known sample complexity for finding an ǫ-Nash equilibrium-O(H 3 SA 1 A 2 /ǫ 2 ) episodes.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

V-Learning -- A Simple, Efficient, Decentralized Algorithm for Multiagent RL

Jin¹,

Liu²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

A major challenge of multiagent reinforcement learning (MARL) is the curse of multiagents, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms-V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with max i∈[m] A i , where A i is the number of actions for the i th player. This is in sharp contrast to the size of the joint action space which is m i=1 A i . V-learning (in its basic form) is a new class of single-agent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into a RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.

show abstract

Near-Optimal Reinforcement Learning with Self-Play

Bai¹,

Jin²,

Yu³

2020

Preprint

View full text Add to dashboard Cite

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires Õ(S 2 AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contrast, the best existing lower bound scales as Ω(S(A + B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity Õ(SAB), and a new Nash V-learning algorithm with sample complexity Õ(S(A + B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode.Towards understanding learning objectives in Markov games other than finding the Nash equilibrium, we present a computational hardness result for learning the best responses against a fixed opponent. This also implies the computational hardness for achieving sublinear regret when playing against adversarial opponents.µ,ν † (µ) 1 (s 1 ) = inf ν V µ,ν h (s 1 ) at step 1. We remark that the best response of a general policy is not necessarily Markovian.

show abstract

Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

Cited by 10 publications

References 60 publications

Towards General Function Approximation in Zero-Sum Markov Games

Towards General Function Approximation in Zero-Sum Markov Games

V-Learning -- A Simple, Efficient, Decentralized Algorithm for Multiagent RL

Near-Optimal Reinforcement Learning with Self-Play

Contact Info

Product

Resources

About