Online Convex Optimization for Sequential Decision Processes and Extensive-Form Games

Farina, Gabriele; Kroer, Christian; Sandholm, Tüomas

doi:10.1609/aaai.v33i01.33011917

Cited by 34 publications

(38 citation statements)

References 19 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our choice of regularizer allows an efficient update of the current policy with a O(HA) time-complexity per episode (see Section 3.3). In particular, our result answers the open problem raised by Farina et al (2021b) and Farina and Sandholm (2021) of providing an algorithm with high-probability regret bound scaling with √ T with O(HA) computations per episode. Interestingly, we can also update the average profile (which will be returned at the end of the learning, see Section 3.3) in an online fashion.…”

Section: Bandit Feedback Model-basedsupporting

confidence: 72%

“…Note that this bound is a consequence of a bound on the regret of both players (see Section 2) that holds even in the non-stochastic setting where an adversary picks a new game at each episode. Closer to our approach, Farina et al (2021b) recast the setting to an adversarial bandit linear optimization (Flaxman et al, 2005;Abernethy et al, 2008, see also Section 3.1). Precisely, they use the online mirror descent (OMD) algorithm with the dilated entropy distance-generating function (Hoda et al, 2010;Kroer et al, 2015) as regularizer.…”

Section: Bandit Feedback Model-basedmentioning

confidence: 99%

See 1 more Smart Citation

Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

Kozuno,

Ménard,

Munos

et al. 2021

Preprint

View full text Add to dashboard Cite

We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamics of the IIG is not known-we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order 1/ √ T where T is the number of played games. Moreover, IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory. * Equal contribution Preprint. Under review.

show abstract

Section: Bandit Feedback Model-basedsupporting

confidence: 72%

Section: Bandit Feedback Model-basedmentioning

confidence: 99%

Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

Kozuno,

Ménard,

Munos

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Although CFR theory calls for both players to simultaneously update their regrets on each iteration, in practice far better performance is achieved by alternating which player updates their regrets on each iteration. However, this complicates the theory for convergence (Farina, Kroer, and Sandholm ;Burch, Moravcik, and Schmid 2018). CFR+ is like CFR but with the following small changes.…”

Section: Notation and Backgroundmentioning

confidence: 99%

Solving Imperfect-Information Games via Discounted Regret Minimization

Brown

Sandholm

2019

AAAI

Self Cite

View full text Add to dashboard Cite

Counterfactual regret minimization (CFR) is a family of iterative algorithms that are the most popular and, in practice, fastest approach to approximately solving large imperfectinformation games. In this paper we introduce novel CFR variants that 1) discount regrets from earlier iterations in various ways (in some cases differently for positive and negative regrets), 2) reweight iterations in various ways to obtain the output strategies, 3) use a non-standard regret minimizer and/or 4) leverage "optimistic regret matching". They lead to dramatically improved performance in many settings. For one, we introduce a variant that outperforms CFR+, the prior state-of-the-art algorithm, in every game tested, including large-scale realistic settings. CFR+ is a formidable benchmark: no other algorithm has been able to outperform it. Finally, we show that, unlike CFR+, many of the important new variants are compatible with modern imperfect-informationgame pruning techniques and one is also compatible with sampling in the game tree.

show abstract

“…Sequential decision making [1] has received much attention in recent years due to the rapid increase in the generation of the streaming data. One representative application field of decision making problems is recommendation system [2], where an agent has to make a decision from several choices, e.g., assign the users with appropriate advertisements [3], articles [4], movies [5] and etc [6].…”

Section: Introductionmentioning

confidence: 99%

An Adaptive Similarity-Measuring-Based CMAB Model for Recommendation System

Zhong

Ying

2020

IEEE Access

View full text Add to dashboard Cite

Online context-based domains such as recommendation systems strive to promptly suggest the appropriate items to users according to the information about items and users. However, such contextual information may be not available in practical, where the only information we can utilize is users' interaction data. Furthermore, the lack of clicked records, especially for the new users, worsens the performance of the system. To address the issues, similarity measuring, one of the key techniques in collaborative filtering, as well as the online context-based multiple armed bandit mechanism, are combined. The similarity between the context of a selected item and any candidate item is calculated and weighted. An adaptive method for adjusting the weights according to the passed time from clicking is proposed. The weighted similarity is then multiplied with the action value to decide which action is optimal or the poorest. Additionally, we come up with an exploration probability equation by introducing the selected times for the poorest action and the variance of the action values, to balance the exploration and exploitation. The regret analysis is given and the upper bound of the regret is proved. Empirical studies on three benchmarks, random dataset, Yahoo!R6A, and MovieLens, demonstrate the effectiveness of the proposed method.

show abstract

Online Convex Optimization for Sequential Decision Processes and Extensive-Form Games

Cited by 34 publications

References 19 publications

Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

Solving Imperfect-Information Games via Discounted Regret Minimization

An Adaptive Similarity-Measuring-Based CMAB Model for Recommendation System

Contact Info

Product

Resources

About