Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Rashid, Tabish; Farquhar, Gregory; Peng, Bei; Whiteson, Shimon

doi:10.48550/arxiv.2006.10800

Cited by 13 publications

(29 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we design ablation studies to investigate the improvement of GVR. Our method is compared with state-of-the-art baselines including QMIX (Rashid et al, 2018), QPLEX , and WQMIX (Rashid et al, 2020). All results are evaluated over 5 seeds.…”

Section: Methodsmentioning

confidence: 99%

“…The other works improve the coordination from different perspectives. WQMIX (Rashid et al, 2020) tries to solve the underestimation of the optimal joint values that arise from the representation limitation, where an auxiliary network with complete expressiveness capacity is applied to distinguishes samples with low expressive values. By placing a predefined weight on these samples, WQMIX can alleviate the underestimation of optimal joint Q values.…”

Section: B2 Value Decomposition Methodsmentioning

confidence: 99%

“…Besides, under ITS, the non-optimal stable points can be eliminated with a large enough ratio of the probability of superior actions (i.e., the actions better than the greedy) to the greedy action. We prove two simple ways applied by previous works (i.e., improving exploration (Mahajan et al, 2019) and reassigning sample weights (Rashid et al, 2020)) are both inapplicable to raise the ratio because the probabilities of superior actions decreases exponentially with the number of agents. Therefore, we further propose the superior experience replay (SER), where the probabilities of superior actions are independent to environmental parameters.…”

Section: Introductionmentioning

confidence: 90%

“…However, learning the complete expressiveness is impractical in complicated MARL tasks because the joint action space increases exponentially with the number of agents. The other kind of method tries to overcome relative overgeneralization by learning a bias (e.g., WQMIX (Rashid et al, 2020) and MAVEN (Mahajan et al, 2019)), which lacks theoretical and quantitative analysis of the problem and is only applicable in specific tasks. As a result, these methods are insufficient to guarantee optimal coordination.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

Wan¹,

Liu²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning (MARL) methods with linear or monotonic value decomposition suffer from the relative overgeneralization. As a result, they can not ensure the optimal coordination. Existing methods address the relative overgeneralization by achieving complete expressiveness or learning a bias, which is insufficient to solve the problem. In this paper, we propose the optimal consistency, a criterion to evaluate the optimality of coordination. To achieve the optimal consistency, we introduce the True-Global-Max (TGM) principle for linear and monotonic value decomposition, where the TGM principle can be ensured when the optimal stable point is the unique stable point. Therefore, we propose the greedy-based value representation (GVR) to ensure the optimal stable point via inferior target shaping and eliminate the non-optimal stable points via superior experience replay. Theoretical proofs and empirical results demonstrate that our method can ensure the optimal consistency under sufficient exploration. In experiments on various benchmarks, GVR significantly outperforms state-of-the-art baselines.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: B2 Value Decomposition Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 90%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

Wan¹,

Liu²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Reward decomposition is a promising way to exploit the CTDE paradigm Rashid et al (2018);Foerster et al (2018); Mahajan et al (2019); Wang et al (2020b); Rashid et al (2020), which is a composite of a local utility network for each agent's execution and a mixing network for combining local utilities into a global action value. Many existing methods try to learn a compelling local utility network, which typically needs a relatively extensive network and more execution time.…”

Section: Introductionmentioning

confidence: 99%

Credit Assignment with Meta-Policy Gradient for Multi-Agent Reinforcement Learning

Shao,

Zhang,

Jiang

et al. 2021

Preprint

View full text Add to dashboard Cite

Reward decomposition is a critical problem in centralized training with decentralized execution (CTDE) paradigm for multi-agent reinforcement learning. To take full advantage of global information, which exploits the states from all agents and the related environment for decomposing Q values into individual credits, we propose a general meta-learning-based Mixing Network with Meta Policy Gradient (MNMPG) framework to distill the global hierarchy for delicate reward decomposition. The excitation signal for learning global hierarchy is deduced from the episode reward difference between before and after "exercise updates" through the utility network. Our method is generally applicable to the CTDE method using a monotonic mixing network. Experiments on the StarCraft II micromanagement benchmark demonstrate that our method just with a simple utility network is able to outperform the current state-of-the-art MARL algorithms on 4 of 5 super hard scenarios. Better performance can be further achieved when combined with a role-based utility network. IntroductionMulti-agent deep reinforcement learning algorithms (MARL) have recently shown extraordinary performance in various games like DOTA2 Berner et al. (2019), StarCraft Samvelyan et al. (2019), and Honor of Kings Ye et al. (2020). The framework of centralized training with decentralized execution (CTDE) Gupta et al. (2017); Rashid et al. (2018), which enjoys the advantages of joint action learning Claus & Boutilier (1998) and independent learning Tan (1993), is one of the popular frameworks for solving collaborative multi-agent tasks. Recent research on the CTDE framework can be divided into two categories: One is to enhance agents' ability with emphasis on individually processing local observations, for instance, allocating a role Wang et al. (2020b,c) or a mode of exploration Mahajan et al. (2019) to each agent, which may need extra prior information and consume more computational resources on decentralized execution due to extensive inference network. Another aims to decompose the single reward to each agent accurately, in other words, training a delicate mixing network for local utility values Rashid et al. (2018); Yang et al. (2020); Wang et al. (2020a), or training a new joint action-value to factorizing tasks Son et al. (2019). The latter's empirical performance on challenging tasks is relatively limited by training instability and demand for delicate parameter adjustment.The current methods on mixing network usually utilize the global information from the full state roughly, such as Rashid et al. ( 2018) which only takes the full state vector as the input of the mixing network. These methods lack an explicit hierarchy, the distilled information from the full state, to decompose Q values into individual credit. Inspired by Meta-DDPG Xu et al. (2018a), where meta-learning is adopted for exploration, in this paper, we present a general meta-learning-based framework called Mixing Network with Meta Policy Gradient (MNMPG) for exploration on better Preprint. Under revi...

show abstract

Cooperative and competitive multi-agent deep reinforcement learning

Chen

2022

2nd International Conference on Artificial Intelligence, Automation, and High-Performance Computing (AIAHPC 2022)

View full text Add to dashboard Cite

Multi-agent reinforcement learning (MARL) is an area of artificial intelligence that investigates joint behaviors of multiple individual agents and emergent patterns arising from their interactions with a common environment. Although MARL has a long history of decades, it begins to intensify recently due to the breakthrough of deep learning methods. In recent years Deep reinforcement learning (DRL) has achieved significant progress in single-agent reinforcement learning problems. Meanwhile, multi-agent systems (MASs) also benefit a lot from DRL methods. Latest advances occur in areas including video games, robot system, smart grids, etc. This article mostly focuses on recent papers on Multi-agent deep reinforcement learning (MADRL). First, some background knowledge of DRL and MARL is introduced. Both value-based and policy-based DRL algorithms are discussed. Second, representative works in both cooperative and competitive scenarios are reviewed respectively. Key ideas and main techniques in each work are discussed. Lastly, the paper draws a conclusion and some potential research directions are proposed.

show abstract

Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Cited by 13 publications

References 10 publications

Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

Credit Assignment with Meta-Policy Gradient for Multi-Agent Reinforcement Learning

Cooperative and competitive multi-agent deep reinforcement learning

Contact Info

Product

Resources

About