2021
DOI: 10.1609/aaai.v35i13.17353
|View full text |Cite
|
Sign up to set email alerts
|

Value-Decomposition Multi-Agent Actor-Critics

Abstract: The exploitation of extra state information has been an active research area in multi-agent reinforcement learning (MARL). QMIX represents the joint action-value using a non-negative function approximator and achieves the best performance on the StarCraft II micromanagement testbed, a common MARL benchmark. However, our experiments demonstrate that, in some cases, QMIX performs sub-optimally with the A2C framework, a training paradigm that promotes algorithm training efficiency. To obtain a reasonable trade-of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 48 publications
(32 citation statements)
references
References 15 publications
0
21
0
Order By: Relevance
“…However, unlike QMIX, VDAC is compatible with A2C, which makes sampling more efficient. In addition, the study demonstrates that following a simple gradient calculated from a temporal-difference advantage, the policy can converge to a local optimal (Su et al 2021). Q-DPP (Yang et al 2020b) do not rely on constraints to decompose the global value function.…”
Section: Centralised Training and Decentralised Executionmentioning
confidence: 87%
“…However, unlike QMIX, VDAC is compatible with A2C, which makes sampling more efficient. In addition, the study demonstrates that following a simple gradient calculated from a temporal-difference advantage, the policy can converge to a local optimal (Su et al 2021). Q-DPP (Yang et al 2020b) do not rely on constraints to decompose the global value function.…”
Section: Centralised Training and Decentralised Executionmentioning
confidence: 87%
“…Also, there exists a series of value-function-factorization-based methods that train decentralized policies in a centralized end-to-end fashion, employing a joint action value network [27]. There are many follow-ups on actor-critic-based MARL algorithms, addressing a variety of issues, namely SAC [26] for improving scalability, SEAC [4] for sharing experience, LICA [44] for the credit assignment problem, TESSER-ACT [21] for tensorizing the critics, Bilevel Actor Critic [42] for multi-agent coordination problem with unequal agents, DAC-TD [6] for training agents in a privacy-aware framework, VDAC [31] for combining value-decomposition framework with actor-critic, Scalable Actor Critic [17] for scalable learning in stochastic network of agents, etc. However, none of these works consider the role emergence paradigm in their model.…”
Section: Related Literaturementioning
confidence: 99%
“…In MARL, each pond could be controlled by an individual agent tuned to that pond's specific goals, while also operating cooperatively towards system-level goals. 54,55 In MORL, sets of policies are learned to approximate a Pareto frontier; 56 this is especially valuable for comparing trade-offs among agents. Similar multi-objective optimization is well studied for reservoir operation and could provide an alternative to MORL.…”
Section: Towards System-level Controlmentioning
confidence: 99%