Safe Linear Thompson Sampling with Side Information

Moradipari, Ahmadreza; Amani, Sanae; Alizadeh, Mahnoosh; Thrampoulidis, Christos

doi:10.48550/arxiv.1911.02156

Cited by 6 publications

(21 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to the previous setting, our constraint is more relaxed (from high-probability to expectation), and as a result, it would be possible for us to obtain a solution with larger expected cumulative reward. We will have a detailed discussion on the relationship between these two settings and the similarities and differences of our results with those reported in Amani et al [2019] and Moradipari et al [2019] in Section 7.…”

Section: Introductionsupporting

confidence: 71%

“…In Figure 3, the reason that the cost evolution of OPB is the same as that of the optimal policy (middle) is that in this case, the cost of the best arm (arm 4) is equal to the constraint threshold τ = .2. As described in Section 1, our setting is the closest to the one studied by Amani et al [2019] and Moradipari et al [2019]. They study a slightly different setting, in which the mean cost of the action that the agent takes should satisfy the constraint, i.e., x t , µ * ≤ τ , not the mean cost of the policy it computes, i.e., x πt , µ * ≤ τ , as in our case.…”

Section: Methodsmentioning

confidence: 99%

“…Clearly, the setting studied in our paper is more relaxed, and thus, is expected to obtain more rewards. Moradipari et al [2019] propose a TS algorithm for their setting and prove an O(d 3/2 √ T /τ ) regret bound for it. They restrict themselves to linear bandits, i.e., A t = A, ∀t ∈ [T ], and the safe action being the origin, i.e., x 0 = 0 and c 0 = 0.…”

Section: Methodsmentioning

confidence: 99%

“…Here the constraint is stage-wise, and unlike the last two settings, is independent of the history. Amani et al [2019] and Moradipari et al [2019] have recently studied this setting for linear bandits and derived and analyzed explore-exploit [Amani et al, 2019] and Thompson sampling [Moradipari et al, 2019] algorithms for it.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Stochastic Bandits with Linear Constraints

Pacchiano,

Ghavamzadeh,

Bartlett

et al. 2020

Preprint

View full text Add to dashboard Cite

We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of T rounds is maximum, and each has an expected cost below a certain threshold τ . We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove an O( d √ T τ −c0 ) bound on its T -round regret, where the denominator is the difference between the constraint threshold and the cost of a known feasible action. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting. We prove a regret bound of O() for this algorithm in K-armed bandits, which is a √ K improvement over the regret bound we obtain by simply casting multi-armed bandits as an instance of contextual linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results.Preprint. Under review.

show abstract

Section: Introductionsupporting

confidence: 71%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Stochastic Bandits with Linear Constraints

Pacchiano,

Ghavamzadeh,

Bartlett

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…One setting, referred to as conservative bandits [31,17,12], requires the cumulative reward to remain above a fixed percentage of the cumulative reward of a given baseline policy. Another setting is where each arm is associated with two unknown distributions (similar to our setting), generating reward and cost signals respectively [3,23,20,21].…”

Section: Related Workmentioning

confidence: 99%

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Liu

Zhou

Kalathil

et al. 2021

Preprint

View full text Add to dashboard Cite

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of Õ( √ K) while allowing an Õ( √ K) constraint violation in K episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order Õ( √ K). The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.

show abstract

Robustness Guarantees for Mode Estimation with an Application to Bandits

Pacchiano

Jiang

Jordan

2021

AAAI

View full text Add to dashboard Cite

Mode estimation is a classical problem in statistics with a wide range of applications in machine learning. Despite this, there is little understanding in its robustness properties under possibly adversarial data contamination. In this paper, we give precise robustness guarantees as well as privacy guarantees under simple randomization. We then introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean. We prove regret guarantees for the problems of top arm identification, top m-arms identification, contextual modal bandits, and infinite continuous arms top arm recovery. We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences, thus rendering modal bandits an attractive choice in situations where the rewards may have outliers or adversarial corruptions.

show abstract

Safe Linear Thompson Sampling with Side Information

Cited by 6 publications

References 9 publications

Stochastic Bandits with Linear Constraints

Stochastic Bandits with Linear Constraints

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Robustness Guarantees for Mode Estimation with an Application to Bandits

Contact Info

Product

Resources

About