2022
DOI: 10.48550/arxiv.2204.00706
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk

Abstract: We investigate a natural but surprisingly unstudied approach to the multi-armed bandit problem under safety risk constraints. Each arm is associated with an unknown law on safety risks and rewards, and the learner's goal is to maximise reward whilst not playing unsafe arms, as determined by a given threshold on the mean risk.We formulate a pseudo-regret for this setting that enforces this safety constraint in a per-round way by softly penalising any violation, regardless of the gain in reward due to the same. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(8 citation statements)
references
References 13 publications
0
8
0
Order By: Relevance
“…The bound above behaves inversely with respect to the three gaps, as well as with respect to ǫ. The following lower bound argues, via a reduction to the safe multi-armed bandit problem [CGS22], that the dependence on min(∆ I , Γ I ) in the above is tight for consistent algorithms ( §D.5).…”
Section: Notementioning
confidence: 99%
See 4 more Smart Citations
“…The bound above behaves inversely with respect to the three gaps, as well as with respect to ǫ. The following lower bound argues, via a reduction to the safe multi-armed bandit problem [CGS22], that the dependence on min(∆ I , Γ I ) in the above is tight for consistent algorithms ( §D.5).…”
Section: Notementioning
confidence: 99%
“…Here, each constraint is associated with a notion of regret S i T = a i , x t − α i , and the overall regret is measured as max((max i S i T ), θ, x * − x t ). As noted by Pacchiano et al [PGBJ21] and Chen et al [CGS22], the main disadvantage of this formulation from our perspective arises from the fact that constraint violations are aggregated. This functionally means that it is okay for effective algorithms to alternate between actions that have large reward & poor safety, and actions that have poor reward but good safety.…”
Section: A a More Detailed Look At Related Workmentioning
confidence: 99%
See 3 more Smart Citations