In many types of multi-agent systems, distributed agents cooperate with each other to take actions with the goal of maximizing an overall system reward. However, in many of these systems, agents only receive a (perhaps noisy) global feedback about the realized overall reward rather than individualized feedback about the relative merit of their own actions with respect to the overall reward. If the contribution of an agent's actions to the overall reward is unknown a priori, it is crucial for the agents to utilize a distributed algorithm that still allows them to learn their best actions. In this paper, we rigorously formalize this problem and develop online learning algorithms which enable the agents to cooperatively learn how to maximize the overall reward in these global feedback scenarios without exchanging any information among themselves. We prove that, if the agents observe the global feedback without errors, the distributed nature of the considered multi-agent system results in no performance loss compared with the case where agents can exchange information. When the agents' individual observations are erroneous, existing centralized algorithms, including popular ones like UCB1, break down. To address this challenge, we propose a novel class of distributed algorithms that are robust to individual observation errors and whose performance can be analytically bounded. We prove that our algorithms' learning regrets -the losses incurred by the algorithms due to uncertainty -are logarithmically increasing in time and thus the time average reward converges to the optimal average reward. Moreover, we also illustrate how the regret depends on the size of the action space, and we show that this relationship is influenced by the informativeness of the reward structure with regard to each agent's individual action.We prove that when the overall reward is fully informative, regret is linear in the total number of actions , IEEE Transactions on Signal Processing 2 of all the agents. When the reward function is not informative, regret is linear in the number of joint actions. Our analytic and numerical results show that the proposed learning algorithms significantly outperform existing online learning solutions in terms of regret and learning speed. We illustrate how our theoretical framework can be used in practice by applying it to online Big Data mining using distributed classifiers. However, our framework can be applied to many other applications including online distributed decision making in cooperative multi-agent systems (e.g. packet routing or network coding in multi-hop networks), cross-layer optimization (e.g. parameter selection in different layers), multi-core processors etc.
Index TermsMulti-agent learning, online learning, multi-armed bandits, Big Data mining, distributed cooperative learning, reward informativeness.