Near-Optimal Learning of Extensive-Form Games with Imperfect Information

Jin, Chi; Song, Mei; Yu, Tao

doi:10.48550/arxiv.2202.01752

Cited by 6 publications

(34 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, we provide a way to combine any gradient estimator (unbiased or biased), any exploration strategy, any interactive strategy, with any full-feedback regret minimizer to assemble a bandit regret minimization method. • We demonstrate that the most recent bandit regret minimization methods, i.e., MCCFR [Lanctot et al, 2009, Farina et al, 2020b, Farina and Sandholm, 2021, IXOMD [Kozuno et al, 2021] and balanced OMD/CFR [Bai et al, 2022], can be analyzed as a special case of our framework. We first present the theoretical bounds for biased gradient estimation bandit regret minimization methods in IIEGs.…”

Section: Introductionmentioning

confidence: 89%

“…So in this setting, before using the full-feedback regret minimizer, it is necessary to estimate the the loss gradient t by v(z t ). There are two ways-the unbiased estimator [Lanctot et al, 2009, Zhou et al, 2019, Farina et al, 2020b, 2021b, Farina and Sandholm, 2021 and the biased estimator [Kozuno et al, 2021, Bai et al, 2022. The former enables the expectation of the output estimated gradient…”

Section: Equilibrium Finding With Regret Minimizationmentioning

confidence: 99%

“…In principle, in two-person zero-sum IIEGs, the most popular method for learning approximate Nash equilibrium is regret minimization methods [Hoda et al, 2010, Zinkevich et al, 2007, Bowling et al, 2015, Brown and Sandholm, 2019a, Farina et al, 2020a, 2021a. They have been used to construct many AI milestones in poker [Moravčík et al, Table 1: Comparison of the recent bandit regret minimization methods Algorithm Model-Free Convergence Rate [Lanctot et al, 2009] O((X √ B + Y √ C)/ √ T ) [Zhou et al, 2019] O(max(X Farina et al, 2020b] O((X √ B + Y √ C)/ √ T ) [Farina and Sandholm, 2021] O(poly(X, B, Y, C)/T 1/4 ) [Farina et al, 2021b] O(((XB) 3/2 + (Y C) 3/2 )/ √ T ) 2 [Kozuno et al, 2021] O((X √ B + Y √ C)/ √ T ) 3 [Bai et al, 2022] O((…”

Section: Introductionmentioning

confidence: 99%

“…To solve this setting, various approaches have been proposed, including model-based exploration [Zhou et al, 2019, Zhang and, Online Mirror Descent (OMD) with unbiased/biased loss estimator [Farina et al, 2021b, Kozuno et al, 2021, Bai et al, 2022, and Monte-Carlo Counterfactual Regret Minimization (MCCFR) [Lanctot et al, 2009, Farina et al, 2020b, Farina and Sandholm, 2021, Bai et al, 2022. In a two-player zero-sum IIEG with X, Y information sets (inforsets) and B, C actions for the two players respectively, the best theoretical convergence rate is O((…”

Section: Introductionmentioning

confidence: 99%

“…achieved by a variant of OMD with biased loss estimator [Bai et al, 2022]. We provide an overview comparison of the above algorithms in Table 1.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Generalized Bandit Regret Minimizer Framework in Imperfect Information Extensive-Form Game

Meng¹,

Gao²

2022

Preprint

View full text Add to dashboard Cite

Regret minimization methods are a powerful tool for learning approximate Nash equilibrium (NE) in two-player zero-sum imperfect information extensive-form games (IIEGs). We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG. In general, only the interactive trajectory and the reached terminal node value v(z t ) are revealed. To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient t by v(z t ) and minimize the regret. In this paper, we propose a generalized framework for this learning setting. It presents a theoretical framework for the design and the modular analysis of the bandit regret minimization methods. We demonstrate that the most recent bandit regret minimization methods can be analyzed as a particular case of our framework. Following this framework, we describe a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of O( XB/T + Y C/T ) 1 to O( M X /T + M Y /T ) 2 . Moreover, SIX-OMD is computationally efficient as it needs to perform the current strategy and average strategy updates only along the sampled trajectory.1 X, Y are the number of information sets and A, B are the number of actions for the two players. 2 MX and MY are defined in Section 5.2, they are extremely less than X and Y .Preprint. Under review.

show abstract

Section: Introductionmentioning

confidence: 89%

Section: Equilibrium Finding With Regret Minimizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…achieved by a variant of OMD with biased loss estimator [Bai et al, 2022]. We provide an overview comparison of the above algorithms in Table 1.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Generalized Bandit Regret Minimizer Framework in Imperfect Information Extensive-Form Game

Meng¹,

Gao²

2022

Preprint

View full text Add to dashboard Cite

show abstract

An Efficient Deep Reinforcement Learning Algorithm for Solving Imperfect Information Extensive-Form Games

Meng

Tian

et al. 2023

AAAI

View full text Add to dashboard Cite

One of the most popular methods for learning Nash equilibrium (NE) in large-scale imperfect information extensive-form games (IIEFGs) is the neural variants of counterfactual regret minimization (CFR). CFR is a special case of Follow-The-Regularized-Leader (FTRL). At each iteration, the neural variants of CFR update the agent's strategy via the estimated counterfactual regrets. Then, they use neural networks to approximate the new strategy, which incurs an approximation error. These approximation errors will accumulate since the counterfactual regrets at iteration t are estimated using the agent's past approximated strategies. Such accumulated approximation error causes poor performance. To address this accumulated approximation error, we propose a novel FTRL algorithm called FTRL-ORW, which does not utilize the agent's past strategies to pick the next iteration strategy. More importantly, FTRL-ORW can update its strategy via the trajectories sampled from the game, which is suitable to solve large-scale IIEFGs since sampling multiple actions for each information set is too expensive in such games. However, it remains unclear which algorithm to use to compute the next iteration strategy for FTRL-ORW when only such sampled trajectories are revealed at iteration t. To address this problem and scale FTRL-ORW to large-scale games, we provide a model-free method called Deep FTRL-ORW, which computes the next iteration strategy using model-free Maximum Entropy Deep Reinforcement Learning. Experimental results on two-player zero-sum IIEFGs show that Deep FTRL-ORW significantly outperforms existing model-free neural methods and OS-MCCFR.

show abstract

Sample-Efficient Learning of Correlated Equilibria in Extensive-Form Games

Song¹,

Song²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model for real-world games involving imperfect information and sequential plays. The Extensive-Form Correlated Equilibrium (EFCE) has been proposed as a natural solution concept for multi-player general-sum IIEFGs. However, existing algorithms for finding an EFCE require full feedback from the game, and it remains open how to efficiently learn the EFCE in the more challenging bandit feedback setting where the game can only be learned by observations from repeated playing.This paper presents the first sample-efficient algorithm for learning the EFCE from bandit feedback. We begin by proposing K-EFCE-a more generalized definition that allows players to observe and deviate from the recommended actions for K times. The K-EFCE includes the EFCE as a special case at K = 1, and is an increasingly stricter notion of equilibrium as K increases. We then design an uncoupled noregret algorithm that finds an ε-approximate K-EFCE within O(maxi XiA K i /ε 2 ) iterations in the full feedback setting, where Xi and Ai are the number of information sets and actions for the i-th player. Our algorithm works by minimizing a wide-range regret at each information set that takes into account all possible recommendation histories. Finally, we design a sample-based variant of our algorithm that learns an ε-approximate K-EFCE within O(maxi XiA K+1 i /ε 2 ) episodes of play in the bandit feedback setting. When specialized to K = 1, this gives the first sample-efficient algorithm for learning EFCE from bandit feedback.

show abstract

Near-Optimal Learning of Extensive-Form Games with Imperfect Information

Cited by 6 publications

References 19 publications

Generalized Bandit Regret Minimizer Framework in Imperfect Information Extensive-Form Game

Generalized Bandit Regret Minimizer Framework in Imperfect Information Extensive-Form Game

An Efficient Deep Reinforcement Learning Algorithm for Solving Imperfect Information Extensive-Form Games

Sample-Efficient Learning of Correlated Equilibria in Extensive-Form Games

Contact Info

Product

Resources

About