Multi-armed bandits in multi-agent networks

Shahrampour, Shahin; Rakhlin, Alexander; Jadbabaie, Ali

doi:10.1109/icassp.2017.7952664

Cited by 59 publications

(36 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we do not investigate learning strategies for repeated play, but only for the one-shot game in isolation. Also related is the multi-agent bandit problem [37], in which a team of agents has to agree on which arm to choose in a classical multi-armed bandit problem. The main difference with our setting is that we are not interested in considering the regret during learning, but only in learning good approximations as close as possible to the original action-value function.…”

Section: Definitionmentioning

confidence: 99%

Analysing factorizations of action-value networks for cooperative multi-agent reinforcement learning

Castellini

Oliehoek

Savani

et al. 2021

Auton Agent Multi-Agent Syst

View full text Add to dashboard Cite

Recent years have seen the application of deep reinforcement learning techniques to cooperative multi-agent systems, with great empirical success. However, given the lack of theoretical insight, it remains unclear what the employed neural networks are learning, or how we should enhance their learning power to address the problems on which they fail. In this work, we empirically investigate the learning power of various network architectures on a series of one-shot games. Despite their simplicity, these games capture many of the crucial problems that arise in the multi-agent setting, such as an exponential number of joint actions or the lack of an explicit coordination mechanism. Our results extend those in Castellini et al. (Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS’19.International Foundation for Autonomous Agents and Multiagent Systems, pp 1862–1864, 2019) and quantify how well various approaches can represent the requisite value functions, and help us identify the reasons that can impede good performance, like sparsity of the values or too tight coordination requirements.

show abstract

Section: Definitionmentioning

confidence: 99%

Analysing factorizations of action-value networks for cooperative multi-agent reinforcement learning

Castellini

Oliehoek

Savani

et al. 2021

Auton Agent Multi-Agent Syst

View full text Add to dashboard Cite

show abstract

“…Most papers on MAMAB consider a set of agents pulling the same set of arms simultaneously, and in most of them the agents coordinate through real-time communications to collaboratively find the optimal policy, with a few exceptions Bubeck et al, 2021). In the former, the communication resource is either costly (Madhushani and Leonard, 2020), limited by budget (Lalitha and Goldsmith, 2021;Vial et al, 2021;Sankararaman et al, 2019;Chawla et al, 2020), or constrained through communication networks (Landgren et al, 2021;Shahrampour et al, 2017), so the main focus is on designing communication efficient schemes that achieve the same performance as if there were no information asymmetry. In another related thread, referred to as the matching bandits problem, agents choosing the same arm collide and obtain zero rewards (Kalathil et al, 2014;; here and in a few other works (Shahrampour et al, 2017), different agents get different distribution of rewards from the same arm, while in other referenced work they get independent and identically distributed samples from the same arm.…”

Section: Appendix a Supplementary Detailsmentioning

confidence: 99%

“…Another line of solutions is to enforce coordination among agents, essentially transforming a multi-agent system back to (or making it more similar to) a single-agent one. One way to achieve this is through communication (Shahrampour et al, 2017;Zhang et al, 2021), which introduces extra costs that may be intolerable in some cases. A more broadly applicable scheme is through the common information (CI) approach (Nayyar et al, 2013;Chang et al, 2021;Dibangoye and Buffet, 2018).…”

Section: Introductionmentioning

confidence: 99%

Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure

Kao

Wei

Subramanian

2021

Preprint

View full text Add to dashboard Cite

Multi-agent reinforcement learning (MARL) problems are challenging due to information asymmetry. To overcome this challenge, existing methods often require high level of coordination or communication between the agents. We consider two-agent multi-armed bandits (MABs) and Markov decision processes (MDPs) with a hierarchical information structure arising in applications, which we exploit to propose simpler and more efficient algorithms that require no coordination or communication. In the structure, in each step the "leader" chooses her action first, and then the "follower" decides his action after observing the leader's action. The two agents observe the same reward (and the same state transition in the MDP setting) that depends on their joint action. For the bandit setting, we propose a hierarchical bandit algorithm that achieves a near-optimal gap-independent regret of O( √ ABT ) and a near-optimal gap-dependent regret of O(log(T )), where A and B are the numbers of actions of the leader and the follower, respectively, and T is the number of steps. We further extend to the case of multiple followers and the case with a deep hierarchy, where we both obtain near-optimal regret bounds. For the MDP setting, we obtain O( √ H 7 S 2 ABT ) regret, where H is the number of steps per episode, S is the number of states, T is the number of episodes. This matches the existing lower bound in terms of A, B, and T .

show abstract

“…Multi-Agent Multi-Armed Bandits. Existing decentralized cooperative MAB algorithms make one or more of the following assumptions: (i) agents independently interact with the same MAB (Lupu, Durand, and Precup 2019), (ii) they use sophisticated communication protocols to exchange information about rewards and the number of times actions were played (Landgren, Srivastava, and Leonard 2016;Martínez-Rubio, Kanade, and Rebeschini 2019;Sankararaman, Ganesh, and Shakkottai 2019;Shahrampour, Rakhlin, and Jadbabaie 2017), (iii) when sophisticated communication is not possible, agents share their latest action and reward (Madhushani and Leonard 2019). These assumptions are often required to simplify the analysis, but are not realistic in most human-AI interactions, e.g., collaborative transport, assembly, cooking, or autonomous driving, where (i) agents' actions influence the outcome for the whole team, (ii) they do not have explicit communication channels or might have different state, action representations that are difficult to communicate, or (iii) they have different capabilities, e.g., noisier sensors.…”

Section: Related Workmentioning

confidence: 99%

Partner-Aware Algorithms in Decentralized Cooperative Bandit Teams

Bıyık¹,

Lalitha²,

Saha³

et al. 2021

Preprint

View full text Add to dashboard Cite

When humans collaborate with each other, they often make decisions by observing others and considering the consequences that their actions may have on the entire team, instead of greedily doing what is best for just themselves. We would like our AI agents to effectively collaborate in a similar way by capturing a model of their partners. In this work, we propose and analyze a decentralized Multi-Armed Bandit (MAB) problem with coupled rewards as an abstraction of more general multiagent collaboration. We demonstrate that naïve extensions of single-agent optimal MAB algorithms fail when applied for decentralized bandit teams. Instead, we propose a Partner-Aware strategy for joint sequential decision-making that extends the well-known single-agent Upper Confidence Bound algorithm. We analytically show that our proposed strategy achieves logarithmic regret, and provide extensive experiments involving human-AI and human-robot collaboration to validate our theoretical findings. Our results show that the proposed partneraware strategy outperforms other known methods, and our human subject studies suggest humans prefer to collaborate with AI agents implementing our partner-aware strategy.

show abstract

Multi-armed bandits in multi-agent networks

Cited by 59 publications

References 23 publications

Analysing factorizations of action-value networks for cooperative multi-agent reinforcement learning

Analysing factorizations of action-value networks for cooperative multi-agent reinforcement learning

Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure

Partner-Aware Algorithms in Decentralized Cooperative Bandit Teams

Contact Info

Product

Resources

About