Adversarial Bandits with Knapsacks

Immorlica, Nicole; Sankararaman, Karthik Abinav; Schapire, Robert E.; Slivkins, Aleksandrs

doi:10.1109/focs.2019.00022

Cited by 32 publications

(45 citation statements)

References 72 publications

(87 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Agrawal and Devanur (2014) generalized the BwK model by allowing arbitrary concave rewards and convex constraints. Furthermore, similar constrained bandit problems are also studied in settings that includes contextual bandits (Agrawal and Devanur, 2014;Wu et al, 2015;Agrawal and Devanur, 2016) and even adversarial bandits (Sun et al, 2017;Immorlica et al, 2019).…”

Section: Dealing With Constraintsmentioning

confidence: 99%

Bandit Algorithms for Precision Medicine

Lu,

Xu,

Tewari

2021

Preprint

View full text Add to dashboard Cite

Section: Dealing With Constraintsmentioning

confidence: 99%

Bandit Algorithms for Precision Medicine

Lu,

Xu,

Tewari

2021

Preprint

View full text Add to dashboard Cite

“…BwK is studied both in an adversarial and i.i.d. settings, but here we only emphasize on the latter (see Immorlica et al (2019) for the adversarial case). Assuming concave reward functions, Agrawal and Devanur (2014) proposes an Upper-Confidence Bound type of algorithms which achieves sublinear rates of regret and constraint violations.…”

Section: Related Workmentioning

confidence: 99%

Joint Online Learning and Decision-making via Dual Mirror Descent

Ruiz¹,

Grigas²,

Wen³

2021

Preprint

View full text Add to dashboard Cite

We consider an online revenue maximization problem over a finite time horizon subject to lower and upper bounds on cost. At each period, an agent receives a context vector sampled i.i.d. from an unknown distribution and needs to make a decision adaptively. The revenue and cost functions depend on the context vector as well as some fixed but possibly unknown parameter vector to be learned. We propose a novel offline benchmark and a new algorithm that mixes an online dual mirror descent scheme with a generic parameter learning process. When the parameter vector is known, we demonstrate an O( √ T ) regret result as well an O( √ T ) bound on the possible constraint violations. When the parameter is not known and must be learned, we demonstrate that the regret and constraint violations are the sums of the previous O( √ T ) terms plus terms that directly depend on the convergence of the learning process.

show abstract

“…The authors in Badanidiyuru et al (2018) introduce Bandits with Knapsack that combines online learning with integer programming for learning under constraints. This setting has been extended to various other settings like linear contextual bandits (Agrawal and Devanur, 2016), combinatorial semi-bandits (Abinav and Slivkins, 2018), adversarial setting (Immorlica et al, 2019), cascading bandits (Zhou et al, 2018). The authors in (Combes et al, 2015) establish lower bound for budgeted bandits and develop algorithms with matching upper bounds.…”

Section: Related Workmentioning

confidence: 99%

Continuous Time Bandits With Sampling Costs

Vaze¹,

Hanawal²

2021

Preprint

View full text Add to dashboard Cite

We consider a continuous time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample, however, increasing the frequency of sampling incurs an additive penalty/cost. Thus, there is a tradeoff between obtaining large reward and incurring sampling cost as a function of the sampling frequency. The goal is to design a learning algorithm that minimizes the regret, that is defined as the difference of the payoff of the oracle policy and that of the learning algorithm. CTMAB is fundamentally different than the usual multi-arm bandit problem (MAB), e.g., even the single arm case is non-trivial in CTMAB, since the optimal sampling frequency depends on the mean of the arm, which needs to be estimated. We first establish lower bounds on the regret achievable with any algorithm, and then propose algorithms that achieve the lower bound up to logarithmic factors. For the single arm case, we show that the lower bound on the regret is Ω((log T ) 2 /µ), where µ is the mean of the arm, and T is the time horizon. For the multiple arms case, we show that the lower bound on the regret is Ω((log T ) 2 µ/∆ 2 ), where µ now represents the mean of the best arm, and ∆ is the difference of the mean of the best and the second best arm. We then propose an algorithm that achieves the bound up to constant terms.

show abstract

Adversarial Bandits with Knapsacks

Cited by 32 publications

References 72 publications

Bandit Algorithms for Precision Medicine

Bandit Algorithms for Precision Medicine

Joint Online Learning and Decision-making via Dual Mirror Descent

Continuous Time Bandits With Sampling Costs

Contact Info

Product

Resources

About