Towards Q-learning the Whittle Index for Restless Bandits

Fu, Jing; Nazarathy, Yoni; Moka, Sarat; Taylor, Peter G.

doi:10.1109/anzcc47194.2019.8945748

Cited by 21 publications

(39 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second category contains different learning methods for RMABs. Fu et al 2019 provide a Q-learning method where the Q value is defined based on the Whittle indices, states, and actions. However, they do not provide proof of convergence to optimal solution and experimentally, do not learn (near-)optimal policies.…”

Section: Related Workmentioning

confidence: 99%

“…(2) AB [Avrachenkov and Borkar, 2020], (3) Fu [Fu et al, 2019], (4) Greedy: greedily chooses the top M arms with the highest difference in their observed average rewards between actions 1 and 0 at their current states, and (5) Random: chooses M arms uniformly at random at each step. We consider a numerical example 1 and a maternal healthcare application to simulate RMAB instances using beneficiaries' behavioral pattern from the call-based program.…”

Section: Experimental Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare

Biswas¹,

Aggarwal²,

Varakantham³

et al. 2021

Preprint

View full text Add to dashboard Cite

In many public health settings, it is important for patients to adhere to health programs, such as taking medications and periodic health checks. Unfortunately, beneficiaries may gradually disengage from such programs, which is detrimental to their health. A concrete example of gradual disengagement has been observed by an organization that carries out a free automated call-based program for spreading preventive care information among pregnant women. Many women stop picking up calls after being enrolled for a few months. To avoid such disengagements, it is important to provide timely interventions. Such interventions are often expensive and can be provided to only a small fraction of the beneficiaries. We model this scenario as a restless multi-armed bandit (RMAB) problem, where each beneficiary is assumed to transition from one state to another depending on the intervention. Moreover, since the transition probabilities are unknown a priori, we propose a Whittle index based Q-Learning mechanism and show that it converges to the optimal solution. Our method improves over existing learning-based methods for RMABs on multiple benchmarks from literature and also on the maternal healthcare dataset.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Experimental Evaluationmentioning

confidence: 99%

Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare

Biswas¹,

Aggarwal²,

Varakantham³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Despite the recent successes of reinforcement learning (RL) for solving large-scale games [28,37], RL has so far seen little application to RMABs, except for a few recent works that learn Whittle indices for indexable binary-action RMABs using (i) deep RL [29] and (ii) Q-learning when states are observable [5,7] or when arms are homogeneous [4]. In contrast, our deep RL approach provides a more general solution to binary and multi-action RMAB domains that performs well regardless of indexability.…”

Section: Related Workmentioning

confidence: 99%

Robust Restless Bandits: Tackling Interval Uncertainty with Deep Reinforcement Learning

Killian¹,

Xu²,

Biswas³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce Robust Restless Bandits, a challenging generalization of restless multi-arm bandits (RMAB). RMABs have been widely studied for intervention planning with limited resources. However, most works make the unrealistic assumption that the transition dynamics are known perfectly, restricting the applicability of existing methods to real-world scenarios. To make RMABs more useful in settings with uncertain dynamics: (i) We introduce the Robust RMAB problem and develop solutions for a minimax regret objective when transitions are given by interval uncertainties; (ii) We develop a double oracle algorithm for solving Robust RMABs and demonstrate its effectiveness on three experimental domains; (iii) To enable our double oracle approach, we introduce RMABPPO, a novel deep reinforcement learning algorithm for solving RMABs. RMABPPO hinges on learning an auxiliary "λ-network" that allows each arm's learning to decouple, greatly reducing sample complexity required for training; (iv) Under minimax regret, the adversary in the double oracle approach is notoriously difficult to implement due to non-stationarity. To address this, we formulate the adversary oracle as a multi-agent reinforcement learning problem and solve it with a multi-agent extension of RMABPPO, which may be of independent interest as the first known algorithm for this setting. Code is available at https://github.com/killian-34/RobustRMAB.Preprint. Under review.

show abstract

“…Addressing this, Biswas et al [5] give a Q-learning-based based algorithm that acts on the arms that have the largest difference between their active and passive Q values. Fu et al [8] take a related approach that adjust the Q values by some 𝜆, and use it to estimate the Whittle index. Similarly, Avrachenkov and Borkar [3] provide a two-timescale algorithm that learns the Q values as well as the index values over time.…”

Section: Related Workmentioning

confidence: 99%

“…To address this shortcoming in previous work, this paper presents the first algorithms for the online setting for multi-action RMABs. Indeed, the online setting for even binary-action RMABs has received only limited attention, in the works of Fu et al [8], Avrachenkov and Borkar [3], and Biswas et al [5,6]. These papers adopt variants of the Q-learning update rule [29,30], a well studied reinforcement learning algorithm, for estimating the effect of each action across changing dynamics of the systems.…”

Section: Introductionmentioning

confidence: 99%

Q-Learning Lagrange Policies for Multi-Action Restless Bandits

Killian,

Biswas,

Shah

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which 𝑁 independent processes are managed. However, previous work only study the offline setting where problem dynamics are known. We address this restrictive assumption, designing the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning. Our first approach, MAIQL, extends a method for Q-learning the Whittle index in binary-action RMABs to the multi-action setting. We derive a generalized update rule and convergence proof and establish that, under standard assumptions, MAIQL converges to the asymptotically optimal multi-action RMAB policy as 𝑡 → ∞. However, MAIQL relies on learning Q-functions and indexes on two timescales which leads to slow convergence and requires problem structure to perform well. Thus, we design a second algorithm, LPQL, which learns the well-performing and more general Lagrange policy for multi-action RMABs by learning to minimize the Lagrange bound through a variant of Q-learning. To ensure fast convergence, we take an approximation strategy that enables learning on a single timescale, then give a guarantee relating the approximation's precision to an upper bound of LPQL's return as 𝑡 → ∞. Finally, we show that our approaches always outperform baselines across multiple settings, including one derived from real-world medication adherence data. CCS CONCEPTS• Computing methodologies → Reinforcement learning.

show abstract

Towards Q-learning the Whittle Index for Restless Bandits

Cited by 21 publications

References 13 publications

Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare

Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare

Robust Restless Bandits: Tackling Interval Uncertainty with Deep Reinforcement Learning

Q-Learning Lagrange Policies for Multi-Action Restless Bandits

Contact Info

Product

Resources

About