2021
DOI: 10.48550/arxiv.2111.11188
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification

Abstract: The idea of conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, it is still an open question to resolve offline RL in the more practical multi-agent setting as many real-world scenarios involve interaction among multiple agents. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 30 publications
0
3
0
Order By: Relevance
“…Most offline RL methods consider the out-of-distribution action [11] as the fundamental challenge, which is the main cause of the extrapolation error [5] in value estimate in the single-agent environment. To minimize the extrapolation error, some recent methods introduce constraints to enforce the learned policy to be close to the behavior policy, which can be direct action constraint [5], kernel MMD [9], Wasserstein distance [30], KL divergence [20], or l2 distance [4,19]. Some methods train a Q-function pessimistic to out-of-distribution actions to avoid overestimation by adding a reward penalty quantified by the learned environment model [34], by minimizing the Q-values of out-of-distribution actions [10,33], by weighting the update of Q-function via Monte Carlo dropout [31], or by explicitly assigning and training pseudo Q-values for out-of-distribution actions [15].…”
Section: Offline Rlmentioning
confidence: 99%
“…Most offline RL methods consider the out-of-distribution action [11] as the fundamental challenge, which is the main cause of the extrapolation error [5] in value estimate in the single-agent environment. To minimize the extrapolation error, some recent methods introduce constraints to enforce the learned policy to be close to the behavior policy, which can be direct action constraint [5], kernel MMD [9], Wasserstein distance [30], KL divergence [20], or l2 distance [4,19]. Some methods train a Q-function pessimistic to out-of-distribution actions to avoid overestimation by adding a reward penalty quantified by the learned environment model [34], by minimizing the Q-values of out-of-distribution actions [10,33], by weighting the update of Q-function via Monte Carlo dropout [31], or by explicitly assigning and training pseudo Q-values for out-of-distribution actions [15].…”
Section: Offline Rlmentioning
confidence: 99%
“…Offline RL easily suffers from the extrapolation error, which is mainly caused by out-of-distribution actions in single-agent environments. Constraint-based methods introduce policy constraints to enforce the learned policy to be close to the behavior policy, e.g., direct action constraint (Fujimoto, Meger, and Precup 2019), kernel MMD (Kumar et al 2019), Wasserstein distance (Wu, Tucker, and Nachum 2019), and l2 distance (Pan et al 2021;Fujimoto and Gu 2021). Conservative methods (Kumar et al 2020;Yu et al 2021) train a Q-function pessimistic to out-of-distribution actions.…”
Section: Related Workmentioning
confidence: 99%
“…One emerging subarea is offline MARL, where plenty of empirical works have been done while the theoretical understanding is still largely missing [Pan et al, 2021, Jiang and Lu, 2021, Meng et al, 2021. Offline RL has received tremendous attention because in various practical scenarios, it is expensive to acquire online data while offline log data is accessible.…”
Section: Introductionmentioning
confidence: 99%