Linear bandits with limited adaptivity and learning distributional optimal design

Ruan, Yongle; Yang, Jiaqi; Zhou, Yuan

doi:10.1145/3406325.3451004

Cited by 9 publications

(14 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bandits with limited adaptivity complexity. There is a lot of interest in obtaining bandit algorithms that update their policies rarely (Abbasi-Yadkori et al 2011, Perchet et al 2016, Gao et al 2019, Dong et al 2020, Chen et al 2020, Ruan et al 2021. Notably, Dong et al (2020) study rare policy switching constraints for a broader class of online learning and decision making problems such as logit bandits.…”

Section: Related Workmentioning

confidence: 99%

The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches

Grigoris¹,

Yang²,

Karbasi³

2022

Preprint

View full text Add to dashboard Cite

In this paper, we study the problem of regret minimization for episodic Reinforcement Learning (RL) both in the model-free and the model-based setting. We focus on learning with general function classes and general model classes, and we derive results that scale with the eluder dimension of these classes. In contrast to the existing body of work that mainly establishes instance-independent regret guarantees, we focus on the instance-dependent setting and show that the regret scales logarithmically with the horizon T , provided that there is a gap between the best and the second best action in every state. In addition, we show that such a logarithmic regret bound is realizable by algorithms with O(log T ) switching cost (also known as adaptivity complexity). In other words, these algorithms rarely switch their policy during the course of their execution. Finally, we complement our results with lower bounds which show that even in the tabular setting, we cannot hope for regret guarantees lower than o(log T ).

show abstract

Section: Related Workmentioning

confidence: 99%

The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches

Grigoris¹,

Yang²,

Karbasi³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We prove that while the communication cost of DisBE-LUCB is only Õ(dN ), it achieves a regret Õ( √ dN T ), which is of the same order as that incurred by a near optimal single-agent algorithm for N T rounds. We highlight that while DisBE-LUCB is inspired by the single-agent BatchLinUCB-DG proposed in [Ruan et al, 2021] in an attempt to save on communication as much as possible, a direct use of confidence set in [Ruan et al, 2021] would fail to guarantee optimal communication cost Õ(dN ) and require more communication for each agent. We address this issue by introducing a new confidence set in DisBE-LUCB and Lemma 1 as our first main technical contribution.…”

Section: Contributionsmentioning

confidence: 99%

“…An important line of work related to communication efficiency in distributed bandits studies practical single agent scenarios using batch elimination methods, in which a very small number of batches may achieve minimax optimal learning performance, and therefore it is possible to enjoy the benefits of both adaptivity and parallelism [Ruan et al, 2021, Han et al, 2020, Gao et al, 2019. Our proposed algorithms are inspired by the single-agent BatchLinUCB-DG proposed in [Ruan et al, 2021] in an attempt to save on communication as much as possible. That said, a direct use and analysis of confidence set in [Ruan et al, 2021] would fail to guarantee optimal communication cost Õ(dN ) and require more communication for each agent.…”

Section: Related Workmentioning

confidence: 99%

“…Our proposed algorithms are inspired by the single-agent BatchLinUCB-DG proposed in [Ruan et al, 2021] in an attempt to save on communication as much as possible. That said, a direct use and analysis of confidence set in [Ruan et al, 2021] would fail to guarantee optimal communication cost Õ(dN ) and require more communication for each agent. We address this issue by introducing a new confidence set, used in DisBE-LUCB, in Lemma 1.…”

Section: Related Workmentioning

confidence: 99%

“…In many practical single agent scenarios, where the agent sequentially makes active queries about the environment, it is desirable to limit these queries to a small number of rounds of interaction, which helps to increase the parallelism of the learning process and reduce the management cost. In recent years, to address such scenarios, a surge of research activity in the area of batch online learning has shown that in many popular online learning tasks, a very small number of batches may achieve minimax optimal learning performance, and therefore it is possible to enjoy the benefits of both adaptivity and parallelism [Ruan et al, 2021, Han et al, 2020, Gao et al, 2019. As such, a careful use of batch learning methods in multi-agent learning scenarios may positively affect the communication efficiency by limiting the number of necessary communication rounds.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost

Amani¹,

Lattimore²,

György³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study distributed contextual linear bandits with stochastic contexts, where N agents act cooperatively to solve a linear bandit-optimization problem with d-dimensional features. For this problem, we propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that over T rounds (N T actions in total) the communication cost of DisBE-LUCB is only Õ(dN ) and its regret is at most Õ( √ dN T ), which is of the same order as that incurred by an optimal single-agent algorithm for N T rounds. Remarkably, we derive an information-theoretic lower bound on the communication cost of the distributed contextual linear bandit problem with stochastic contexts, and prove that our proposed algorithm is nearly minimax optimal in terms of both regret and communication cost. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their immediate neighbors through a carefully designed consensus procedure.

show abstract

Active learning for data streams: a survey

Cacciarelli,

Kulahci

2023

Mach Learn

View full text Add to dashboard Cite

Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in real time. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research.

show abstract

Linear bandits with limited adaptivity and learning distributional optimal design

Cited by 9 publications

References 15 publications

The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches

The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches

Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost

Active learning for data streams: a survey

Contact Info

Product

Resources

About