2020
DOI: 10.48550/arxiv.2007.01962
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Reward Machines for Cooperative Multi-Agent Reinforcement Learning

Cyrus Neary,
Zhe Xu,
Bo Wu
et al.

Abstract: In cooperative multi-agent reinforcement learning, a collection of agents learns to interact in a shared environment to achieve a common goal. We propose the use of reward machines (RM) -Mealy machines used as structured representations of reward functions -to encode the team's task. The proposed novel interpretation of RMs in the multi-agent setting explicitly encodes required teammate interdependencies and independencies, allowing the team-level task to be decomposed into sub-tasks for individual agents. We … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
2

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 20 publications
0
4
0
Order By: Relevance
“…The current work assumes each agent has an individual reward machine and the reward machine is known, but it would be possible to learn reward machines from experience [24] and decompose a team-level task to a collection of reward machines [21]. We use a standard actor-critc algorithm, one possible research direction is to use more advanced RL algorithms such as Proximal Policy Optimization (PPO) [39] to allow for solving continuous action problems.…”
Section: Discussionmentioning
confidence: 99%
“…The current work assumes each agent has an individual reward machine and the reward machine is known, but it would be possible to learn reward machines from experience [24] and decompose a team-level task to a collection of reward machines [21]. We use a standard actor-critc algorithm, one possible research direction is to use more advanced RL algorithms such as Proximal Policy Optimization (PPO) [39] to allow for solving continuous action problems.…”
Section: Discussionmentioning
confidence: 99%
“…In fact, for each finitehorizon objective Ο• i , there exists a DFA A Ο• with a unique accepting state with all outgoing transitions from that state being self-loops. These DFAs can be interpreted as reward machines [36], [37] such that the reward of every transition from a non-accepting state to the accepting state is 1, and is 0 otherwise. Note that the minimax optimal value for the undiscounted sum of rewards (total reward) over the product of this reward machine and discretized subsystem Ξ£ Ξ΄i is equal to the optimal probability of satisfaction of the DFA specification.…”
Section: Proofmentioning
confidence: 99%
“…The assumption of the user-provided reward machine has been lifted in the follow-up works (Gaon and Brafman 2020;Xu et al 2020;Furelos-Blanco et al 2020). Learning temporal representations of the reward has been explored in different contexts: for multi-agents settings (Neary et al 2020), for reward shaping (Velasquez et al 2021b), or with userprovided advice (Neider et al 2021). All these approaches are fragile in presence of noisy rewards.…”
Section: Related Workmentioning
confidence: 99%