2020
DOI: 10.48550/arxiv.2006.10185
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Stochastic Bandits with Linear Constraints

Aldo Pacchiano,
Mohammad Ghavamzadeh,
Peter Bartlett
et al.

Abstract: We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of T rounds is maximum, and each has an expected cost below a certain threshold τ . We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove an O( d √ T τ −c0 ) bound on its T -round regret, where the denominator is the difference between the constraint threshold and the c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
20
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(20 citation statements)
references
References 14 publications
(22 reference statements)
0
20
0
Order By: Relevance
“…Since the computational complexity of the dual component depends on the number of constraints, but is independent of sizes of the contextual space, the action space, and the feature space, the overall computational complexity of our algorithm is similar to that of LinUCB in the unconstrained setting. Since our anytime cumulative constraint (2) is most similar to an anytime policy constraint in [31], we first compare our algorithm with OPLB proposed in [31]. OPLB needs to construct a safe policy set in each round.…”
Section: Main Contributionsmentioning
confidence: 99%
See 4 more Smart Citations
“…Since the computational complexity of the dual component depends on the number of constraints, but is independent of sizes of the contextual space, the action space, and the feature space, the overall computational complexity of our algorithm is similar to that of LinUCB in the unconstrained setting. Since our anytime cumulative constraint (2) is most similar to an anytime policy constraint in [31], we first compare our algorithm with OPLB proposed in [31]. OPLB needs to construct a safe policy set in each round.…”
Section: Main Contributionsmentioning
confidence: 99%
“…it imposes a cumulative constraint in every round. This anytime cumulative constraint is most similar to an anytime policy constraint in [31] because the average cost of a policy is close to its mean after the policy has been applied for many rounds and the process converges, so can be viewed as a cumulative constraint on actions over many rounds (like ours). Furthermore, when our anytime cumulative constraint (2) is satisfied, our learner guarantees that the time-average cost is below a threshold in every round.…”
Section: Introductionmentioning
confidence: 97%
See 3 more Smart Citations