Abstract:We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic v… Show more
“…Agrawal and Devanur (2014) generalized the BwK model by allowing arbitrary concave rewards and convex constraints. Furthermore, similar constrained bandit problems are also studied in settings that includes contextual bandits (Agrawal and Devanur, 2014;Wu et al, 2015;Agrawal and Devanur, 2016) and even adversarial bandits (Sun et al, 2017;Immorlica et al, 2019).…”
“…Agrawal and Devanur (2014) generalized the BwK model by allowing arbitrary concave rewards and convex constraints. Furthermore, similar constrained bandit problems are also studied in settings that includes contextual bandits (Agrawal and Devanur, 2014;Wu et al, 2015;Agrawal and Devanur, 2016) and even adversarial bandits (Sun et al, 2017;Immorlica et al, 2019).…”
“…BwK is studied both in an adversarial and i.i.d. settings, but here we only emphasize on the latter (see Immorlica et al (2019) for the adversarial case). Assuming concave reward functions, Agrawal and Devanur (2014) proposes an Upper-Confidence Bound type of algorithms which achieves sublinear rates of regret and constraint violations.…”
We consider an online revenue maximization problem over a finite time horizon subject to lower and upper bounds on cost. At each period, an agent receives a context vector sampled i.i.d. from an unknown distribution and needs to make a decision adaptively. The revenue and cost functions depend on the context vector as well as some fixed but possibly unknown parameter vector to be learned. We propose a novel offline benchmark and a new algorithm that mixes an online dual mirror descent scheme with a generic parameter learning process. When the parameter vector is known, we demonstrate an O( √ T ) regret result as well an O( √ T ) bound on the possible constraint violations. When the parameter is not known and must be learned, we demonstrate that the regret and constraint violations are the sums of the previous O( √ T ) terms plus terms that directly depend on the convergence of the learning process.
“…The authors in Badanidiyuru et al (2018) introduce Bandits with Knapsack that combines online learning with integer programming for learning under constraints. This setting has been extended to various other settings like linear contextual bandits (Agrawal and Devanur, 2016), combinatorial semi-bandits (Abinav and Slivkins, 2018), adversarial setting (Immorlica et al, 2019), cascading bandits (Zhou et al, 2018). The authors in (Combes et al, 2015) establish lower bound for budgeted bandits and develop algorithms with matching upper bounds.…”
We consider a continuous time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample, however, increasing the frequency of sampling incurs an additive penalty/cost. Thus, there is a tradeoff between obtaining large reward and incurring sampling cost as a function of the sampling frequency. The goal is to design a learning algorithm that minimizes the regret, that is defined as the difference of the payoff of the oracle policy and that of the learning algorithm. CTMAB is fundamentally different than the usual multi-arm bandit problem (MAB), e.g., even the single arm case is non-trivial in CTMAB, since the optimal sampling frequency depends on the mean of the arm, which needs to be estimated. We first establish lower bounds on the regret achievable with any algorithm, and then propose algorithms that achieve the lower bound up to logarithmic factors. For the single arm case, we show that the lower bound on the regret is Ω((log T ) 2 /µ), where µ is the mean of the arm, and T is the time horizon. For the multiple arms case, we show that the lower bound on the regret is Ω((log T ) 2 µ/∆ 2 ), where µ now represents the mean of the best arm, and ∆ is the difference of the mean of the best and the second best arm. We then propose an algorithm that achieves the bound up to constant terms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.