The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2019
DOI: 10.1609/aaai.v33i01.33011917
|View full text |Cite
|
Sign up to set email alerts
|

Online Convex Optimization for Sequential Decision Processes and Extensive-Form Games

Abstract: Regret minimization is a powerful tool for solving large-scale extensive-form games. State-of-the-art methods rely on minimizing regret locally at each decision point. In this work we derive a new framework for regret minimization on sequential decision problems and extensive-form games with general compact convex sets at each decision point and general convex losses, as opposed to prior work which has been for simplex decision points and linear losses. We call our framework laminar regret decomposition. It ge… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
37
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 34 publications
(38 citation statements)
references
References 19 publications
(30 reference statements)
1
37
0
Order By: Relevance
“…Our choice of regularizer allows an efficient update of the current policy with a O(HA) time-complexity per episode (see Section 3.3). In particular, our result answers the open problem raised by Farina et al (2021b) and Farina and Sandholm (2021) of providing an algorithm with high-probability regret bound scaling with √ T with O(HA) computations per episode. Interestingly, we can also update the average profile (which will be returned at the end of the learning, see Section 3.3) in an online fashion.…”
Section: Bandit Feedback Model-basedsupporting
confidence: 72%
See 1 more Smart Citation
“…Our choice of regularizer allows an efficient update of the current policy with a O(HA) time-complexity per episode (see Section 3.3). In particular, our result answers the open problem raised by Farina et al (2021b) and Farina and Sandholm (2021) of providing an algorithm with high-probability regret bound scaling with √ T with O(HA) computations per episode. Interestingly, we can also update the average profile (which will be returned at the end of the learning, see Section 3.3) in an online fashion.…”
Section: Bandit Feedback Model-basedsupporting
confidence: 72%
“…Note that this bound is a consequence of a bound on the regret of both players (see Section 2) that holds even in the non-stochastic setting where an adversary picks a new game at each episode. Closer to our approach, Farina et al (2021b) recast the setting to an adversarial bandit linear optimization (Flaxman et al, 2005;Abernethy et al, 2008, see also Section 3.1). Precisely, they use the online mirror descent (OMD) algorithm with the dilated entropy distance-generating function (Hoda et al, 2010;Kroer et al, 2015) as regularizer.…”
Section: Bandit Feedback Model-basedmentioning
confidence: 99%
“…Although CFR theory calls for both players to simultaneously update their regrets on each iteration, in practice far better performance is achieved by alternating which player updates their regrets on each iteration. However, this complicates the theory for convergence (Farina, Kroer, and Sandholm ;Burch, Moravcik, and Schmid 2018). CFR+ is like CFR but with the following small changes.…”
Section: Notation and Backgroundmentioning
confidence: 99%
“…Sequential decision making [1] has received much attention in recent years due to the rapid increase in the generation of the streaming data. One representative application field of decision making problems is recommendation system [2], where an agent has to make a decision from several choices, e.g., assign the users with appropriate advertisements [3], articles [4], movies [5] and etc [6].…”
Section: Introductionmentioning
confidence: 99%