Off-Policy Multi-Agent Decomposed Policy Gradients

Wang, Yihan; Han, Beining; Wang, Tonghan; Dong, Heng; Zhang, Chongjie

doi:10.48550/arxiv.2007.12322

Cited by 19 publications

(28 citation statements)

References 27 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…according to the Theorem 1 and Theorem 2. However, it is not easy to optimize k s directly because the bias in k+1,i s is accumulated as the number of agents increases, which makes learning unstable [47]. Therefore, we adopt the n-step evaluation to eliminate the accumulated bias.…”

Section: Multi-agent Policy Evaluation With λ-Returnmentioning

confidence: 99%

See 1 more Smart Citation

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

Yang¹,

Ma²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Learning from datasets without interaction with environments (Offline Learning) is an essential step to apply Reinforcement Learning (RL) algorithms in real-world scenarios. However, compared with the single-agent counterpart, offline multiagent RL introduces more agents with the larger state and action space, which is more challenging but attracts little attention. We demonstrate current offline RL algorithms are ineffective in multi-agent systems due to the accumulated extrapolation error. In this paper, we propose a novel offline RL algorithm, named Implicit Constraint Q-learning (ICQ), which effectively alleviates the extrapolation error by only trusting the state-action pairs given in the dataset for value estimation. Moreover, we extend ICQ to multi-agent tasks by decomposing the joint-policy under the implicit constraint. Experimental results demonstrate that the extrapolation error is reduced to almost zero and insensitive to the number of agents. We further show that ICQ achieves the state-of-the-art performance in the challenging multi-agent offline tasks (StarCraft II). IntroductionRecently, reinforcement learning (RL), an active learning process, has achieved massive success in various domains ranging from strategy games [51] to recommendation systems [6]. However, applying RL to real-world scenarios poses practical challenges: interaction with the real world, such as autonomous driving, is usually expensive or risky. To solve these issues, offline RL is an excellent choice to deal with practical problems [2,22,30,36,13,24,3,21,46,10], aiming at learning from a fixed dataset without interaction with environments.The greatest obstacle of offline RL is the distribution shift issue [14], which leads to extrapolation error, a phenomenon in which unseen state-action pairs are erroneously estimated. Unlike the online setting, the inaccurate estimated values of unseen pairs cannot be corrected by interacting with the environment. Therefore, most off-policy RL algorithms fail in the offline tasks due to intractable overestimation. Modern offline methods (e.g., Batch-Constrained deep Q-learning (BCQ) [14]) aim to enforce the learned policy to be close to the behavior policy or suppress the Q-value directly. These methods have achieved massive success in challenging single-agent offline tasks like D4RL [12].However, many decision processes in real-world scenarios belong to multi-agent systems, such as intelligent transportation systems [1], sensor networks [31], and power grids [5]. We demonstrate that unseen state-action pairs will grow exponentially as the number of agents increases in multi-agent systems, accumulating the extrapolation error quickly. Moreover, the current offline algorithms † Equal Contribution.‡ Corresponding Author.Preprint. Under review.

show abstract

Section: Multi-agent Policy Evaluation With λ-Returnmentioning

confidence: 99%

“…We first construct the multi-agent offline datasets based on ten maps in StarCraft II (see Table 3 in Appendix E). The datasets are made by collecting DOP [47] training data. All maps share the same reward function, and each map includes 12000 trajectories.…”

Section: Multi-agent Offline Tasks On Starcraft IImentioning

confidence: 99%

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

Yang¹,

Ma²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…An alternative paradigm called centralized training and decentralized execution (CTDE; Kraemer & Banerjee, 2016) is widely used in both policy-based and value-based methods. Policy-based multi-agent reinforcement learning methods use a centralized critic to compute gradient for the local actors (Lowe et al, 2017;Foerster et al, 2018;Wen et al, 2019;Wang et al, 2020d). Value-based methods usually decompose the joint value function into individual value functions under the IGM (individual-global-max) principle, which guarantees the consistency between local action selection and joint action optimization (Sunehag et al, 2018;Rashid et al, 2020b;Son et al, 2019;Wang et al, 2021a;Rashid et al, 2020a).…”

Section: Related Workmentioning

confidence: 99%

Self-Organized Polynomial-Time Coordination Graphs

Yang¹,

Dong²,

Ren³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Coordination graph is a promising approach to model agent collaboration in multiagent reinforcement learning. It factorizes a large multi-agent system into a suite of overlapping groups that represent the underlying coordination dependencies. One critical challenge in this paradigm is the complexity of computing maximumvalue actions for a graph-based value factorization. It refers to the decentralized constraint optimization problem (DCOP), which and whose constant-ratio approximation are NP-hard problems. To bypass this fundamental hardness, this paper proposes a novel method, named Self-Organized Polynomial-time Coordination Graphs (SOP-CG), which uses structured graph classes to guarantee the optimality of the induced DCOPs with sufficient function expressiveness. We extend the graph topology to be state-dependent, formulate the graph selection as an imaginary agent, and finally derive an end-to-end learning paradigm from the unified Bellman optimality equation. In experiments, we show that our approach learns interpretable graph topologies, induces effective coordination, and improves performance across a variety of cooperative multi-agent tasks.

show abstract

“…Proper credit assignment is essential for coordination among multiple agents in both policy-based [6,30] and value-based [23] cooperative MARL. The credit each agent receives must reflect their contribution towards the coordinated performance.…”

Section: Introductionmentioning

confidence: 99%

Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

Zhou¹,

Zhang²,

Shao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Extending transfer learning to cooperative multi-agent reinforcement learning (MARL) has recently received much attention. In contrast to the single-agent setting, the coordination indispensable in cooperative MARL constrains each agent's policy. However, existing transfer methods focus exclusively on agent policy and ignores coordination knowledge. We propose a new architecture that realizes robust coordination knowledge transfer through appropriate decomposition of the overall coordination into several coordination patterns. We use a novel mixing network named level-adaptive QTransformer (LA-QTransformer) to realize agent coordination that considers credit assignment, with appropriate coordination patterns for different agents realized by a novel level-adaptive Transformer (LA-Transformer) dedicated to the transfer of coordination knowledge. In addition, we use a novel agent network named Population Invariant agent with Transformer (PIT) to realize the coordination transfer in more varieties of scenarios. Extensive experiments in StarCraft II micro-management show that LA-QTransformer together with PIT achieves superior performance compared with state-of-the-art baselines.

show abstract

Off-Policy Multi-Agent Decomposed Policy Gradients

Cited by 19 publications

References 27 publications

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

Self-Organized Polynomial-Time Coordination Graphs

Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

Contact Info

Product

Resources

About