Addressing the Long-term Impact of ML Decisions via Policy Regret

Lindner, David; Heidari, Hoda; Krause, Andreas

doi:10.24963/ijcai.2021/75

Cited by 4 publications

(5 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus far, our motivation for the tallying bandit setting has been primarily theoretical, to resolve the gap in our understanding of when we can efficiently minimize CPR. Nevertheless, in similar vein to Heidari et al [HKR16], Lindner et al [LHK21] and Awasthi et al [ABGK22], we believe that the tallying bandit is a simple approximation for various practical settings. For instance, in recommender systems the reward associated with an action is rarely static, because the stimulus of recommended content influences user preferences [CLA + 03, SGR16].…”

Section: Tallying Banditssupporting

confidence: 82%

“…A natural restriction is to enforce that each g x has special structure. This is precisely the approach taken by works on rotting bandits [HKR16, LCM17, SLC + 19, SMLV20], improving bandits [HKR16], single peaked bandits [LHK21] and congested bandits [ABGK22]. Concretely, these works use base functions {g x } x∈X that have the following special "tallying" structure.…”

Section: Restricting the Adversarymentioning

confidence: 99%

“…Improving & Single Peaked Bandits. The improving bandit [HKR16] and single peaked bandit [LHK21] are both special cases of our tallying bandit setting, and deserve special attention. In improving bandits, the reward of an arm is an increasing, concave function of the number of times it has been pulled.…”

Section: Reinforcement Learning (Rl)mentioning

confidence: 99%

“…The strongest possible version of policy regret is when C T is the complete policy class (i.e., the set of all deterministic policies), and we abbreviate this as the complete policy regret. This challenging setting has recently received attention from the online learning community [HKR16,SMLV20,LHK21]. On the other hand, this performance metric is equivalent to the one that is standard in the closely related field of reinforcement learning, where a vast literature explores how to efficiently maximize cumulative reward [KMN99, SMSM00, KKL03, JAZBJ18, WDYK20, MPSL21].…”

Section: Introductionmentioning

confidence: 99%

“…We believe this is a worthwhile endeavor, since attaining sublinear CPR is a nontrivial task riddled with subtleties, even in settings that make much stronger assumptions than we do. For instance, the improving [HKR16] and single peaked [LHK21] bandit settings enforce m = T , require monotonicity and convexity conditions on {h x } x∈X , and also require the losses are observed deterministically. Even with these strong requirements, the best known CPR guarantees are asymptotic bounds that may decay arbitrarily slowly, and their algorithms cannot handle m < T .…”

mentioning

confidence: 99%

See 4 more Smart Citations

Complete Policy Regret Bounds for Tallying Bandits

Malik¹,

Li²,

Singh³

2022

Preprint

View full text Add to dashboard Cite

Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the complete policy regret, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the tallying bandit. This is an online learning setting with an m-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last m timesteps. For tallying bandit problems with K actions and time horizon T , we provide an algorithm that w.h.p achieves a complete policy regret guarantee of O(mK √ T ), where the O notation hides only logarithmic factors. We additionally prove an Ω( √ mKT ) lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.

show abstract

Section: Tallying Banditssupporting

confidence: 82%

Section: Restricting the Adversarymentioning

confidence: 99%

Section: Reinforcement Learning (Rl)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations