2019
DOI: 10.48550/arxiv.1911.02156
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Safe Linear Thompson Sampling with Side Information

Abstract: The design and performance analysis of bandit algorithms in the presence of stage-wise safety or reliability constraints has recently garnered significant interest. In this work, we consider the linear stochastic bandit problem under additional linear safety constraints that need to be satisfied at each round. We provide a new safe algorithm based on linear Thompson Sampling (TS) for this problem and show a frequentist regret of order O(d 3/2 log 1/2 d • T 1/2 log 3/2 T ), which remarkably matches the results … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
20
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 6 publications
(21 citation statements)
references
References 9 publications
1
20
0
Order By: Relevance
“…Compared to the previous setting, our constraint is more relaxed (from high-probability to expectation), and as a result, it would be possible for us to obtain a solution with larger expected cumulative reward. We will have a detailed discussion on the relationship between these two settings and the similarities and differences of our results with those reported in Amani et al [2019] and Moradipari et al [2019] in Section 7.…”
Section: Introductionsupporting
confidence: 71%
See 3 more Smart Citations
“…Compared to the previous setting, our constraint is more relaxed (from high-probability to expectation), and as a result, it would be possible for us to obtain a solution with larger expected cumulative reward. We will have a detailed discussion on the relationship between these two settings and the similarities and differences of our results with those reported in Amani et al [2019] and Moradipari et al [2019] in Section 7.…”
Section: Introductionsupporting
confidence: 71%
“…In Figure 3, the reason that the cost evolution of OPB is the same as that of the optimal policy (middle) is that in this case, the cost of the best arm (arm 4) is equal to the constraint threshold τ = .2. As described in Section 1, our setting is the closest to the one studied by Amani et al [2019] and Moradipari et al [2019]. They study a slightly different setting, in which the mean cost of the action that the agent takes should satisfy the constraint, i.e., x t , µ * ≤ τ , not the mean cost of the policy it computes, i.e., x πt , µ * ≤ τ , as in our case.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…One setting, referred to as conservative bandits [31,17,12], requires the cumulative reward to remain above a fixed percentage of the cumulative reward of a given baseline policy. Another setting is where each arm is associated with two unknown distributions (similar to our setting), generating reward and cost signals respectively [3,23,20,21].…”
Section: Related Workmentioning
confidence: 99%