2019
DOI: 10.48550/arxiv.1901.08029
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning to Collaborate in Markov Decision Processes

Abstract: We consider a two-agent MDP framework where agents repeatedly solve a task in a collaborative setting. We study the problem of designing a learning algorithm for the first agent (A 1 ) that facilitates successful collaboration even in cases when the second agent (A 2 ) is adapting its policy in an unknown way. The key challenge in our setting is that the first agent faces non-stationarity in rewards and transitions because of the adaptive behavior of the second agent.We design novel online learning algorithms … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 15 publications
(57 reference statements)
0
4
0
Order By: Relevance
“…Assumption 1 states that the visitation measures do not change drastically when similar policies are executed. This notion of smoothness in visitation measures also appears in [41] in the context of two-player games.…”
Section: Model Assumptionsmentioning
confidence: 86%
See 2 more Smart Citations
“…Assumption 1 states that the visitation measures do not change drastically when similar policies are executed. This notion of smoothness in visitation measures also appears in [41] in the context of two-player games.…”
Section: Model Assumptionsmentioning
confidence: 86%
“…In addition to updating and evaluating policy, Algorithm 1 features a periodic restart mechanism, which resets its policy estimate every τ episodes. Restart mechanisms have been used to handle non-stationarity in RL [27,39] and related problems including bandits [6], online convex optimization [7,26] and games [17,41]. Intuitively, by employing the restart mechanism, Algorithm 1 is able to stabilize its iterates against non-stationary drift in the learning process due to adversarial reward functions.…”
Section: Powermentioning
confidence: 99%
See 1 more Smart Citation
“…Finally, our last assumption stipulates that the state visitation distributions are smooth with respect to the (embedded) mean-field states of the MFG. This assumption is analogous to those in the literature on MDP and two-player games (Fei et al, 2020;Radanovic et al, 2019), which requires the visitation distributions to be smooth with respect to the policy. Assumption 5.…”
Section: Assumption 4 (Finite Concentrability Coefficientsmentioning
confidence: 99%