Learning from datasets without interaction with environments (Offline Learning) is an essential step to apply Reinforcement Learning (RL) algorithms in real-world scenarios. However, compared with the single-agent counterpart, offline multiagent RL introduces more agents with the larger state and action space, which is more challenging but attracts little attention. We demonstrate current offline RL algorithms are ineffective in multi-agent systems due to the accumulated extrapolation error. In this paper, we propose a novel offline RL algorithm, named Implicit Constraint Q-learning (ICQ), which effectively alleviates the extrapolation error by only trusting the state-action pairs given in the dataset for value estimation. Moreover, we extend ICQ to multi-agent tasks by decomposing the joint-policy under the implicit constraint. Experimental results demonstrate that the extrapolation error is reduced to almost zero and insensitive to the number of agents. We further show that ICQ achieves the state-of-the-art performance in the challenging multi-agent offline tasks (StarCraft II).
IntroductionRecently, reinforcement learning (RL), an active learning process, has achieved massive success in various domains ranging from strategy games [51] to recommendation systems [6]. However, applying RL to real-world scenarios poses practical challenges: interaction with the real world, such as autonomous driving, is usually expensive or risky. To solve these issues, offline RL is an excellent choice to deal with practical problems [2,22,30,36,13,24,3,21,46,10], aiming at learning from a fixed dataset without interaction with environments.The greatest obstacle of offline RL is the distribution shift issue [14], which leads to extrapolation error, a phenomenon in which unseen state-action pairs are erroneously estimated. Unlike the online setting, the inaccurate estimated values of unseen pairs cannot be corrected by interacting with the environment. Therefore, most off-policy RL algorithms fail in the offline tasks due to intractable overestimation. Modern offline methods (e.g., Batch-Constrained deep Q-learning (BCQ) [14]) aim to enforce the learned policy to be close to the behavior policy or suppress the Q-value directly. These methods have achieved massive success in challenging single-agent offline tasks like D4RL [12].However, many decision processes in real-world scenarios belong to multi-agent systems, such as intelligent transportation systems [1], sensor networks [31], and power grids [5]. We demonstrate that unseen state-action pairs will grow exponentially as the number of agents increases in multi-agent systems, accumulating the extrapolation error quickly. Moreover, the current offline algorithms † Equal Contribution.‡ Corresponding Author.Preprint. Under review.