2020
DOI: 10.48550/arxiv.2007.12322
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Off-Policy Multi-Agent Decomposed Policy Gradients

Abstract: Recently, multi-agent policy gradient (MAPG) methods witness vigorous progress. However, there is a discrepancy between the performance of MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate the causes that hinder the performance of MAPG algorithms and present a multiagent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
28
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 19 publications
(28 citation statements)
references
References 27 publications
(31 reference statements)
0
28
0
Order By: Relevance
“…according to the Theorem 1 and Theorem 2. However, it is not easy to optimize k s directly because the bias in k+1,i s is accumulated as the number of agents increases, which makes learning unstable [47]. Therefore, we adopt the n-step evaluation to eliminate the accumulated bias.…”
Section: Multi-agent Policy Evaluation With λ-Returnmentioning
confidence: 99%
See 1 more Smart Citation
“…according to the Theorem 1 and Theorem 2. However, it is not easy to optimize k s directly because the bias in k+1,i s is accumulated as the number of agents increases, which makes learning unstable [47]. Therefore, we adopt the n-step evaluation to eliminate the accumulated bias.…”
Section: Multi-agent Policy Evaluation With λ-Returnmentioning
confidence: 99%
“…We first construct the multi-agent offline datasets based on ten maps in StarCraft II (see Table 3 in Appendix E). The datasets are made by collecting DOP [47] training data. All maps share the same reward function, and each map includes 12000 trajectories.…”
Section: Multi-agent Offline Tasks On Starcraft IImentioning
confidence: 99%
“…An alternative paradigm called centralized training and decentralized execution (CTDE; Kraemer & Banerjee, 2016) is widely used in both policy-based and value-based methods. Policy-based multi-agent reinforcement learning methods use a centralized critic to compute gradient for the local actors (Lowe et al, 2017;Foerster et al, 2018;Wen et al, 2019;Wang et al, 2020d). Value-based methods usually decompose the joint value function into individual value functions under the IGM (individual-global-max) principle, which guarantees the consistency between local action selection and joint action optimization (Sunehag et al, 2018;Rashid et al, 2020b;Son et al, 2019;Wang et al, 2021a;Rashid et al, 2020a).…”
Section: Related Workmentioning
confidence: 99%
“…Proper credit assignment is essential for coordination among multiple agents in both policy-based [6,30] and value-based [23] cooperative MARL. The credit each agent receives must reflect their contribution towards the coordinated performance.…”
Section: Introductionmentioning
confidence: 99%