Abstract:Regret minimization is a powerful tool for solving large-scale extensive-form games. State-of-the-art methods rely on minimizing regret locally at each decision point. In this work we derive a new framework for regret minimization on sequential decision problems and extensive-form games with general compact convex sets at each decision point and general convex losses, as opposed to prior work which has been for simplex decision points and linear losses. We call our framework laminar regret decomposition. It ge… Show more
“…Our choice of regularizer allows an efficient update of the current policy with a O(HA) time-complexity per episode (see Section 3.3). In particular, our result answers the open problem raised by Farina et al (2021b) and Farina and Sandholm (2021) of providing an algorithm with high-probability regret bound scaling with √ T with O(HA) computations per episode. Interestingly, we can also update the average profile (which will be returned at the end of the learning, see Section 3.3) in an online fashion.…”
Section: Bandit Feedback Model-basedsupporting
confidence: 72%
“…Note that this bound is a consequence of a bound on the regret of both players (see Section 2) that holds even in the non-stochastic setting where an adversary picks a new game at each episode. Closer to our approach, Farina et al (2021b) recast the setting to an adversarial bandit linear optimization (Flaxman et al, 2005;Abernethy et al, 2008, see also Section 3.1). Precisely, they use the online mirror descent (OMD) algorithm with the dilated entropy distance-generating function (Hoda et al, 2010;Kroer et al, 2015) as regularizer.…”
We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamics of the IIG is not known-we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order 1/ √ T where T is the number of played games. Moreover, IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory. * Equal contribution Preprint. Under review.
“…Our choice of regularizer allows an efficient update of the current policy with a O(HA) time-complexity per episode (see Section 3.3). In particular, our result answers the open problem raised by Farina et al (2021b) and Farina and Sandholm (2021) of providing an algorithm with high-probability regret bound scaling with √ T with O(HA) computations per episode. Interestingly, we can also update the average profile (which will be returned at the end of the learning, see Section 3.3) in an online fashion.…”
Section: Bandit Feedback Model-basedsupporting
confidence: 72%
“…Note that this bound is a consequence of a bound on the regret of both players (see Section 2) that holds even in the non-stochastic setting where an adversary picks a new game at each episode. Closer to our approach, Farina et al (2021b) recast the setting to an adversarial bandit linear optimization (Flaxman et al, 2005;Abernethy et al, 2008, see also Section 3.1). Precisely, they use the online mirror descent (OMD) algorithm with the dilated entropy distance-generating function (Hoda et al, 2010;Kroer et al, 2015) as regularizer.…”
We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamics of the IIG is not known-we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order 1/ √ T where T is the number of played games. Moreover, IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory. * Equal contribution Preprint. Under review.
“…Although CFR theory calls for both players to simultaneously update their regrets on each iteration, in practice far better performance is achieved by alternating which player updates their regrets on each iteration. However, this complicates the theory for convergence (Farina, Kroer, and Sandholm ;Burch, Moravcik, and Schmid 2018). CFR+ is like CFR but with the following small changes.…”
Counterfactual regret minimization (CFR) is a family of iterative algorithms that are the most popular and, in practice, fastest approach to approximately solving large imperfectinformation games. In this paper we introduce novel CFR variants that 1) discount regrets from earlier iterations in various ways (in some cases differently for positive and negative regrets), 2) reweight iterations in various ways to obtain the output strategies, 3) use a non-standard regret minimizer and/or 4) leverage "optimistic regret matching". They lead to dramatically improved performance in many settings. For one, we introduce a variant that outperforms CFR+, the prior state-of-the-art algorithm, in every game tested, including large-scale realistic settings. CFR+ is a formidable benchmark: no other algorithm has been able to outperform it. Finally, we show that, unlike CFR+, many of the important new variants are compatible with modern imperfect-informationgame pruning techniques and one is also compatible with sampling in the game tree.
“…Sequential decision making [1] has received much attention in recent years due to the rapid increase in the generation of the streaming data. One representative application field of decision making problems is recommendation system [2], where an agent has to make a decision from several choices, e.g., assign the users with appropriate advertisements [3], articles [4], movies [5] and etc [6].…”
Online context-based domains such as recommendation systems strive to promptly suggest the appropriate items to users according to the information about items and users. However, such contextual information may be not available in practical, where the only information we can utilize is users' interaction data. Furthermore, the lack of clicked records, especially for the new users, worsens the performance of the system. To address the issues, similarity measuring, one of the key techniques in collaborative filtering, as well as the online context-based multiple armed bandit mechanism, are combined. The similarity between the context of a selected item and any candidate item is calculated and weighted. An adaptive method for adjusting the weights according to the passed time from clicking is proposed. The weighted similarity is then multiplied with the action value to decide which action is optimal or the poorest. Additionally, we come up with an exploration probability equation by introducing the selected times for the poorest action and the variance of the action values, to balance the exploration and exploitation. The regret analysis is given and the upper bound of the regret is proved. Empirical studies on three benchmarks, random dataset, Yahoo!R6A, and MovieLens, demonstrate the effectiveness of the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.