Safe Linear Thompson Sampling With Side Information

Moradipari, Ahmadreza; Amani, Sanae; Alizadeh, Mahnoosh; Thrampoulidis, Christos

doi:10.1109/tsp.2021.3089822

Cited by 10 publications

(12 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We verify the theoretical study above with simulations over Example 5.6, and study the relative performance of DOCLB and the optimistic-pessimistic method Safe-LTS [MAAT21]. These implementations are based on the following relaxation of Algorithm 1.…”

Section: Simulationsmentioning

confidence: 85%

A Doubly Optimistic Strategy for Safe Linear Bandits

Chen¹,

Gangrade²,

Saligrama³

2022

Preprint

View full text Add to dashboard Cite

We propose a doubly optimistic strategy for the safe-linear-bandit problem, DOSLB. The safe linear bandit problem is to optimise an unknown linear reward whilst satisfying unknown round-wise safety constraints on actions, using stochastic bandit feedback of reward and safety-risks of actions. In contrast to prior work on aggregated resource constraints, our formulation explicitly demands control on roundwise safety risks.Unlike existing optimistic-pessimistic paradigms for safe bandits, DOSLB exercises supreme optimism, using optimistic estimates of reward and safety scores to select actions. Yet, and surprisingly, we show that DOSLB rarely takes risky actions, and obtains Õ(d √ T ) regret, where our notion of regret accounts for both inefficiency and lack of safety of actions. Specialising to polytopal domains, we first notably show that the √ T -regret bound cannot be improved even with large gaps, and then identify a slackened notion of regret for which we show tight instancedependent O(log 2 T ) bounds. We further argue that in such domains, the number of times an overly risky action is played is also bounded as O(log 2 T ). j Preprint. Under review.

show abstract

Section: Simulationsmentioning

confidence: 85%

A Doubly Optimistic Strategy for Safe Linear Bandits

Chen¹,

Gangrade²,

Saligrama³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In contrast, Pacchiano et al (2021); Amani et al (2019); Wu et al (2016) all use optimistic-pessimistic methods, which instead maintain upper bounds on both the rewards and safety risk and play the actions with maximum reward upper bound whilst being safe with respect to the stringent risk upper bounds. Moradipari et al (2021) take a similar pessimistic approach, but replace the reward upper bounds with a Thompson sampling procedure that is similar in spirit to our Alg. 2, although this uses optimistic safety indices.…”

Section: Methodological Approachesmentioning

confidence: 99%

“…These papers also study hard round-wise safety constraints, and again utilise a known safe action, as well as the continuity of the action space to enable sufficient exploration. We note that the particulars of the signalling model adopted by Amani et al (2019) paper preclude extending their results to the multi-armed setting, and while the model of Moradipari et al (2021) does admit such extension, the scheme proposed fundamentally relies on having a continuous action space with a linear safety-risk, and cannot be extended to multi-armed settings without lifting to policy space.…”

Section: Per-round Constraintsmentioning

confidence: 97%

“…A similar approach, but crucially without the policy action space, was taken by Amani et al (2019); Moradipari et al (2021) for in the linear bandit setting. These papers also study hard round-wise safety constraints, and again utilise a known safe action, as well as the continuity of the action space to enable sufficient exploration.…”

Section: Per-round Constraintsmentioning

confidence: 99%

See 1 more Smart Citation

Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk

Chen¹,

Gangrade²,

Saligrama³

2022

Preprint

View full text Add to dashboard Cite

We investigate a natural but surprisingly unstudied approach to the multi-armed bandit problem under safety risk constraints. Each arm is associated with an unknown law on safety risks and rewards, and the learner's goal is to maximise reward whilst not playing unsafe arms, as determined by a given threshold on the mean risk.We formulate a pseudo-regret for this setting that enforces this safety constraint in a per-round way by softly penalising any violation, regardless of the gain in reward due to the same. This has practical relevance to scenarios such as clinical trials, where one must maintain safety for each round rather than in an aggregated sense.We describe doubly optimistic strategies for this scenario, which maintain optimistic indices for both safety risk and reward. We show that schema based on both frequentist and Bayesian indices satisfy tight gap-dependent logarithmic regret bounds, and further that these play unsafe arms only logarithmically many times in total. This theoretical analysis is complemented by simulation studies demonstrating the effectiveness of the proposed schema, and probing the domains in which their use is appropriate.

show abstract

“…Two well-known algorithms for LB are: linear UCB (LinUCB) and linear Thompson Sampling (LinTS). [8] provided a regret bound of order O( √ T log T ) for LinUCB, and [9], [10], [11], and [12] provided a regret bound of order O( √ T (log T ) 3/2 ) for LinTS in a frequentist setting, where the unknown reward parameter θ is fixed.…”

Section: A Related Workmentioning

confidence: 99%

Collaborative Multi-agent Stochastic Linear Bandits

Moradipari¹,

Ghavamzadeh²,

Alizadeh³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

We study a collaborative multi-agent stochastic linear bandit setting, where N agents that form a network communicate locally to minimize their overall regret. In this setting, each agent has its own linear bandit problem (its own reward parameter) and the goal is to select the best global action w.r.t. the average of their reward parameters. At each round, each agent proposes an action, and one action is randomly selected and played as the network action. All the agents observe the corresponding rewards of the played action, and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents. We propose a distributed upper confidence bound (UCB) algorithm and prove a high probability bound on its T -round regret in which we include a linear growth of regret associated with each communication round. Our regret bound is of order O

show abstract

Safe Linear Thompson Sampling With Side Information

Cited by 10 publications

References 24 publications

A Doubly Optimistic Strategy for Safe Linear Bandits

A Doubly Optimistic Strategy for Safe Linear Bandits

Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk

Collaborative Multi-agent Stochastic Linear Bandits

Contact Info

Product

Resources

About