2021
DOI: 10.48550/arxiv.2109.03396
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Zero-sum Stochastic Games with Posterior Sampling

Mehdi Jafarnia-Jahromi,
Rahul Jain,
Ashutosh Nayyar

Abstract: In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of O(HS √ AT ) in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here H is an upper bound on the span of the bias function, S is the number of states, A is the number of joint actions and T is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(6 citation statements)
references
References 12 publications
0
6
0
Order By: Relevance
“…This result is similar to the bound obtained in [24] for the regret in infinite-horizon MDP. In addition, our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works of [23] and [9] in the same setting. We also provide a high-probability bound for the regret of our algorithm.…”
Section: Our Contributionmentioning
confidence: 91%
See 4 more Smart Citations
“…This result is similar to the bound obtained in [24] for the regret in infinite-horizon MDP. In addition, our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works of [23] and [9] in the same setting. We also provide a high-probability bound for the regret of our algorithm.…”
Section: Our Contributionmentioning
confidence: 91%
“…This definition is classical for the regret in decentralized learning [9,21,23], and generalizes the definition for the MDP setting [1,24]. Note that in contrast with MDPs, the regret is not necessarily non-negative: If the opponent is weak, then the learning agent can achieve an average-reward greater than * .…”
Section: Learning Objectivementioning
confidence: 98%
See 3 more Smart Citations