Decentralized learning for multi-player multi-armed bandits

Kalathil, Dileep; Nayyar, Naumaan; Jain, Rahul

doi:10.1109/cdc.2012.6426587

Cited by 21 publications

(32 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a consequence, orthogonalizing the players over the best arms may not be the optimal allocation. In [22], Kalathil et al considered the case where arm ranks may be different across players. They proposed a decentralized policy that achieves regret under the i.i.d.…”

Section: Related Work On Rmabmentioning

confidence: 99%

Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics

Liu

Zhao

2013

IEEE Trans. Inform. Theory

134

151

View full text Add to dashboard Cite

We consider the restless multiarmed bandit problem with unknown dynamics in which a player chooses one out of arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which arm is the most rewarding and always plays the best arm. We construct a policy with an interleaving exploration and exploitation epoch structure that achieves a regret with logarithmic order. We further extend the problem to a decentralized setting where multiple distributed players share the arms without information exchange. Under both an exogenous restless model and an endogenous restless model, we show that a decentralized extension of the proposed policy preserves the logarithmic regret order as in the centralized setting. The results apply to adaptive learning in various dynamic systems and communication networks, as well as financial investment.Index Terms-Distributed learning, online learning, regret, restless multiarmed bandit (RMAB).

show abstract

Section: Related Work On Rmabmentioning

confidence: 99%

Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics

Liu

Zhao

2013

IEEE Trans. Inform. Theory

134

151

View full text Add to dashboard Cite

show abstract

“…• In our problem setting, both noise-limited and interferencelimited transmission models are studied, and we do not impose any limitation on the interference pattern. This is in contrast with [16], [17] and [31], where the interference is either completely neglected or is limited to the neighboring users. This is important since in general channel allocation based on interference avoidance is suboptimal.…”

Section: B Our Contributionmentioning

confidence: 72%

“…This stands in contrast to [15], where the reward of each specific channel is assumed to be equal for all users, and only the availability is stochastic. • Unlike [18] and [31], our algorithm does not require information exchange.…”

Section: B Our Contributionmentioning

confidence: 96%

See 1 more Smart Citation

Channel Selection for Network-Assisted D2D Communication via No-Regret Bandit Learning With Calibrated Forecasting

Maghsudi

Stańczak

2015

IEEE Trans. Wireless Commun.

View full text Add to dashboard Cite

We consider the distributed channel selection problem in the context of device-to-device (D2D) communication as an underlay to a cellular network. Underlaid D2D users communicate directly by utilizing the cellular spectrum, but their decisions are not governed by any centralized controller. Selfish D2D users that compete for access to the resources form a distributed system where the transmission performance depends on channel availability and quality. This information, however, is difficult to acquire. Moreover, the adverse effects of D2D users on cellular transmissions should be minimized. In order to overcome these limitations, we propose a network-assisted distributed channel selection approach in which D2D users are only allowed to use vacant cellular channels. This scenario is modeled as a multi-player multi-armed bandit game with side information, for which a distributed algorithmic solution is proposed. The solution is a combination of no-regret learning and calibrated forecasting, and can be applied to a broad class of multi-player stochastic learning problems, in addition to the formulated channel selection problem. Theoretical analysis shows that the proposed approach not only yields vanishing regret in comparison to the global optimal solution but also guarantees that the empirical joint frequencies of the game converge to the set of correlated equilibria.

show abstract

“…In [16], Kalathil et al studied the problem of distributed online learning with multiple players in multi-armed bandits and proposed an online index-based distributed learning policy. In [17], N-armed bandits have been applied in pay-per-click auctions for Internet advertising, while in [18] for truthful sponsored search auctions and in [19] for keywords selection by search-based advertising.…”

Section: Related Workmentioning

confidence: 99%

Bidding Strategies in QoS-Aware Cloud Systems Based on N-Armed Bandit Problems

Abundo

Valerio

Cardellini

et al. 2014

2014 IEEE 3rd Symposium on Network Cloud Computing and Applications (Ncca 2014)

View full text Add to dashboard Cite

Abstract-In this paper we consider a set of Software as a Service (SaaS) providers, that offer a set of Web services using the Cloud facilities provided by an Infrastructure as a Service (IaaS) provider. We assume that the IaaS provider offers a pay only what you use scheme similar to the Amazon EC2 service, comprising flat, on demand, and spot virtual machine instances. We propose a two-stage provisioning scheme. In the first stage, the SaaS providers determine the number of required flat and on demand instances by means of standard optimization techniques. In the second stage, the SaaS providers compete by bidding for the spot instances which are instantiated using the unused IaaS capacity. We put our focus on the bidding decision process by the SaaS providers, which takes place during the second stage, and apply N-armed bandit problems, in which the player is faced repeatedly with a choice among N different options, and every time he submits his decision evaluating past feedbacks. Through numerical experiments, we analyze proposed strategies under different scenarios and prove the SaaS providers ability to refine their behavior round by round and to determine the best bid so to maximize their revenue and achieve as many spot resources as possible, also addressing the importance of a trade-off between exploration and exploitation, i.e., among greedy and non-greedy actions.

show abstract

Decentralized learning for multi-player multi-armed bandits

Cited by 21 publications

References 23 publications

Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics

Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics

Channel Selection for Network-Assisted D2D Communication via No-Regret Bandit Learning With Calibrated Forecasting

Bidding Strategies in QoS-Aware Cloud Systems Based on N-Armed Bandit Problems

Contact Info

Product

Resources

About