Abstract:In this paper we establish efficient and uncoupled learning dynamics so that, when employed by all players in a general-sum multiplayer game, the swap regret of each player after T repetitions of the game is bounded by O(log T ), improving over the prior best bounds of O(log 4 (T )). At the same time, we guarantee optimal O( √ T ) swap regret in the adversarial regime as well. To obtain these results, our primary contribution is to show that when all players follow our dynamics with a time-invariant learning r… Show more
“…The O( √ XAT ) trigger regret asserted in Theorem 7 improves over Theorem 6 by a factor of Π 1 , and matches the information-theoretic lower bound up to poly(H) and log factors 3 . By the online-to-batch conversion (Appendix B.2), Theorem 7 also implies an O(H 4 XA/ε 2 ) sample complexity for learning EFCE under bandit feedback (assuming same game sizes for all m players).…”
Section: Resultssupporting
confidence: 65%
“…Two important special cases of Φ-regret are the internal regret and swap regret in normal-form games [43,8]. A recent line of work developed algorithms with O(polylogT ) swap regret bound in normal-form games [2,3].…”
A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the Φ-Hedge algorithm-A generic algorithm capable of learning a large class of equilibria for NFGs. We show that Φ-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the Φ-Hedge algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp O( √ XAT ) EFCE-regret under bandit-feedback in an EFG with X information sets, A actions, and T episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.
“…The O( √ XAT ) trigger regret asserted in Theorem 7 improves over Theorem 6 by a factor of Π 1 , and matches the information-theoretic lower bound up to poly(H) and log factors 3 . By the online-to-batch conversion (Appendix B.2), Theorem 7 also implies an O(H 4 XA/ε 2 ) sample complexity for learning EFCE under bandit feedback (assuming same game sizes for all m players).…”
Section: Resultssupporting
confidence: 65%
“…Two important special cases of Φ-regret are the internal regret and swap regret in normal-form games [43,8]. A recent line of work developed algorithms with O(polylogT ) swap regret bound in normal-form games [2,3].…”
A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the Φ-Hedge algorithm-A generic algorithm capable of learning a large class of equilibria for NFGs. We show that Φ-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the Φ-Hedge algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp O( √ XAT ) EFCE-regret under bandit-feedback in an EFG with X information sets, A actions, and T episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.
“…Such last-iterate convergence results for adaptive methods are relative rare in the literature, and most of them assume perfect oracle feedback. To the best of our knowledge, the closest antecedents to our result are [2,39], but both works make the more stringent cocoercive assumptions and consider adaptive learning rate that is the same for all the players. In particular, their learning rates are computed with global feedback and are thus less suitable for the learning-in-game setup.…”
Section: Compared To Theorem 5 We Can Now Only Boundmentioning
confidence: 93%
“…This is made possible thanks to a clear distinction between additive and multiplicative noise; the latter has only been formerly explored in the game-theoretic context by [4,39] for the class of cocoercive games. 2 Relaxing the cocoercivity assumption is a nontrivial challenge, as testified by the few number of works that establish last-iterate convergence results of stochastic algorithms for monotone games. Except for [27] mentioned above, this was achieved either through mini-batching [10,30], Tikhonov regularization / Halpen iteration [37], or both [11].…”
Section: B Further Related Workmentioning
confidence: 99%
“…Because of this mechanism -and the fact that players are changing their actions incrementally from one round to the nextthe learners are facing a much more "predictable" sequence of events. As a result, there has been a number of research threads in the literature showing that it is possible to attain near-constant regret (i.e., at most polylogarithmic) in different classes of games, from the works of [14,33] on finite two-player zero-sum games, to more recent works on general-sum finite games [1,2,16], extensive form games [20], and even continuous games [28].…”
We examine the problem of regret minimization when the learner is involved in a continuous game with other optimizing agents: in this case, if all players follow a no-regret algorithm, it is possible to achieve significantly lower regret relative to fully adversarial environments. We study this problem in the context of variationally stable games (a class of continuous games which includes all convex-concave and monotone games), and when the players only have access to noisy estimates of their individual payoff gradients. If the noise is additive, the game-theoretic and purely adversarial settings enjoy similar regret guarantees; however, if the noise is multiplicative, we show that the learners can, in fact, achieve constant regret. We achieve this faster rate via an optimistic gradient scheme with learning rate separationthat is, the method's extrapolation and update steps are tuned to different schedules, depending on the noise profile. Subsequently, to eliminate the need for delicate hyperparameter tuning, we propose a fully adaptive method that smoothly interpolates between worst-and best-case regret guarantees.
An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for swap regret, and thus, along the way, imply convergence to a correlated equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of weighted regret minimization, with unknown weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.