Reward decomposition is a critical problem in centralized training with decentralized execution (CTDE) paradigm for multi-agent reinforcement learning. To take full advantage of global information, which exploits the states from all agents and the related environment for decomposing Q values into individual credits, we propose a general meta-learning-based Mixing Network with Meta Policy Gradient (MNMPG) framework to distill the global hierarchy for delicate reward decomposition. The excitation signal for learning global hierarchy is deduced from the episode reward difference between before and after "exercise updates" through the utility network. Our method is generally applicable to the CTDE method using a monotonic mixing network. Experiments on the StarCraft II micromanagement benchmark demonstrate that our method just with a simple utility network is able to outperform the current state-of-the-art MARL algorithms on 4 of 5 super hard scenarios. Better performance can be further achieved when combined with a role-based utility network.
IntroductionMulti-agent deep reinforcement learning algorithms (MARL) have recently shown extraordinary performance in various games like DOTA2 Berner et al. (2019), StarCraft Samvelyan et al. (2019), and Honor of Kings Ye et al. (2020). The framework of centralized training with decentralized execution (CTDE) Gupta et al. (2017); Rashid et al. (2018), which enjoys the advantages of joint action learning Claus & Boutilier (1998) and independent learning Tan (1993), is one of the popular frameworks for solving collaborative multi-agent tasks. Recent research on the CTDE framework can be divided into two categories: One is to enhance agents' ability with emphasis on individually processing local observations, for instance, allocating a role Wang et al. (2020b,c) or a mode of exploration Mahajan et al. (2019) to each agent, which may need extra prior information and consume more computational resources on decentralized execution due to extensive inference network. Another aims to decompose the single reward to each agent accurately, in other words, training a delicate mixing network for local utility values Rashid et al. (2018); Yang et al. (2020); Wang et al. (2020a), or training a new joint action-value to factorizing tasks Son et al. (2019). The latter's empirical performance on challenging tasks is relatively limited by training instability and demand for delicate parameter adjustment.The current methods on mixing network usually utilize the global information from the full state roughly, such as Rashid et al. ( 2018) which only takes the full state vector as the input of the mixing network. These methods lack an explicit hierarchy, the distilled information from the full state, to decompose Q values into individual credit. Inspired by Meta-DDPG Xu et al. (2018a), where meta-learning is adopted for exploration, in this paper, we present a general meta-learning-based framework called Mixing Network with Meta Policy Gradient (MNMPG) for exploration on better Preprint. Under revi...