Reward Machines for Cooperative Multi-Agent Reinforcement Learning

Neary, Cyrus; Xu, Zhe; Wu, Bo; Topcu, Ufuk

doi:10.48550/arxiv.2007.01962

Cited by 4 publications

(4 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The current work assumes each agent has an individual reward machine and the reward machine is known, but it would be possible to learn reward machines from experience [24] and decompose a team-level task to a collection of reward machines [21]. We use a standard actor-critc algorithm, one possible research direction is to use more advanced RL algorithms such as Proximal Policy Optimization (PPO) [39] to allow for solving continuous action problems.…”

Section: Discussionmentioning

confidence: 99%

Decentralized Graph-Based Multi-Agent Reinforcement Learning Using Reward Machines

Hu,

Xu,

Wang

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In multi-agent reinforcement learning (MARL), it is challenging for a collection of agents to learn complex temporally extended tasks. The difficulties lie in computational complexity and how to learn the high-level ideas behind reward functions. We study the graph-based Markov Decision Process (MDP) where the dynamics of neighboring agents are coupled. To learn complex temporally extended tasks, we use a reward machine (RM) to encode each agent's task and expose reward function internal structures. RM has the capacity to describe high-level knowledge and encode non-Markovian reward functions. We propose a decentralized learning algorithm to tackle computational complexity, called decentralized graph-based reinforcement learning using reward machines (DGRM), that equips each agent with a localized policy, allowing agents to make decisions independently, based on the information available to the agents. DGRM uses the actor-critic structure, and we introduce the tabular Q-function for discrete state problems. We show that the dependency of Q-function on other agents decreases exponentially as the distance between them increases. Furthermore, the complexity of DGRM is related to the local information size of the largest 𝜅-hop neighborhood, and DGRM can find an 𝑂 (𝜌 𝜅+1 )-approximation of a stationary point of the objective function. To further improve efficiency, we also propose the deep DGRM algorithm, using deep neural networks to approximate the Q-function and policy function to solve large-scale or continuous state problems. The effectiveness of the proposed DGRM algorithm is evaluated by two case studies, UAV package delivery and COVID-19 pandemic mitigation. Experimental results show that local information is sufficient for DGRM and agents can accomplish complex tasks with the help of RM. DGRM improves the global accumulated reward by 119% compared to the baseline in the case of COVID-19 pandemic mitigation.

show abstract

Section: Discussionmentioning

confidence: 99%

Decentralized Graph-Based Multi-Agent Reinforcement Learning Using Reward Machines

Hu,

Xu,

Wang

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In fact, for each finitehorizon objective ϕ i , there exists a DFA A ϕ with a unique accepting state with all outgoing transitions from that state being self-loops. These DFAs can be interpreted as reward machines [36], [37] such that the reward of every transition from a non-accepting state to the accepting state is 1, and is 0 otherwise. Note that the minimax optimal value for the undiscounted sum of rewards (total reward) over the product of this reward machine and discretized subsystem Σ δi is equal to the optimal probability of satisfaction of the DFA specification.…”

Section: Proofmentioning

confidence: 99%

Compositional Reinforcement Learning for Discrete-Time Stochastic Control Systems

Lavaei,

Perez,

Kazemi

et al. 2023

IEEE Open J. Control. Syst.

View full text Add to dashboard Cite

We propose a compositional approach to synthesize policies for networks of continuousspace stochastic control systems with unknown dynamics using model-free reinforcement learning (RL). The approach is based on implicitly abstracting each subsystem in the network with a finite Markov decision process with unknown transition probabilities, synthesizing a strategy for each abstract model in an assume-guarantee fashion using RL, and then mapping the results back over the original network with approximate optimality guarantees. We provide lower bounds on the satisfaction probability of the overall network based on those over individual subsystems. A key contribution is to leverage the convergence results for adversarial RL (minimax Q-learning) on finite stochastic arenas to provide control strategies maximizing the probability of satisfaction over the network of continuous-space systems. We consider finite-horizon properties expressed in the syntactically co-safe fragment of linear temporal logic. These properties can readily be converted into automata-based reward functions, providing scalar reward signals suitable for RL. Since such reward functions are often sparse, we supply a potential-based reward shaping technique to accelerate learning by producing dense rewards. The effectiveness of the proposed approaches is demonstrated via two physical benchmarks including regulation of a room temperature network and control of a road traffic network.

show abstract

“…The assumption of the user-provided reward machine has been lifted in the follow-up works (Gaon and Brafman 2020;Xu et al 2020;Furelos-Blanco et al 2020). Learning temporal representations of the reward has been explored in different contexts: for multi-agents settings (Neary et al 2020), for reward shaping (Velasquez et al 2021b), or with userprovided advice (Neider et al 2021). All these approaches are fragile in presence of noisy rewards.…”

Section: Related Workmentioning

confidence: 99%

Reinforcement Learning with Stochastic Reward Machines

Corazza

Gavran

Neider

2022

AAAI

View full text Add to dashboard Cite

Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.

show abstract

Reward Machines for Cooperative Multi-Agent Reinforcement Learning

Cited by 4 publications

References 20 publications

Decentralized Graph-Based Multi-Agent Reinforcement Learning Using Reward Machines

Decentralized Graph-Based Multi-Agent Reinforcement Learning Using Reward Machines

Compositional Reinforcement Learning for Discrete-Time Stochastic Control Systems

Reinforcement Learning with Stochastic Reward Machines

Contact Info

Product

Resources

About