Power-of-2-arms for bandit learning with switching costs

Shi, Ming; Lin, Xiaojun; Jiao, Lei

doi:10.1145/3492866.3549720

Cited by 5 publications

(20 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, we divide the state space S and construct special state transitions, such that the episodic reinforcement learning is reduced to Θ(S/H) chains of bandit learning. Notice that the lower-bound analysis in [21] implies that, with the loss function l t upper-bounded by H, and with A arms and T time-slots, the regret of any bandit-learning algorithm with switching costs is at least Ω β 1/3 A 1/3 (HT ) 2/3 when T ≥ max{6H 2 A, β}. Hence, the total regret from all Θ(S/H) chains of bandit learning…”

Section: Discussionmentioning

confidence: 99%

“…As we discussed above, the idea for reducing switching in static RL does not work well here. To handle the losses that can change arbitrarily, our design is inspired by the approach in [21] for bandit learning, but with two novel ideas. (a) We delay each switch by a fixed (but tunable) number of episodes, which ensures that switch occurs only every Õ(T 1/3 ) episodes.…”

Section: Our Contributionsmentioning

confidence: 99%

“…Switching costs have also been studied in metrical task systems [33], online set covering [34], k-server problem [35], online control [36,37,38], etc. Moreover, switching costs have been studied in adversarial bandit learning, e.g., in [18,19,20,21]. Our work in this paper can be viewed as a non-trivial generalization of these studies on bandit learning to adversarial MDP, where state transitions and multiple layers in each episode require new developments in both the algorithm design and regret analysis.…”

Section: Switching Costsmentioning

confidence: 99%

“…SEEDS is inspired by the episodic method in bandit learning [21]. In bandit learning, the idea is to divide the time horizon into Θ(T 2/3 ) episodes, and pull one single Exp3-arm in an episode.…”

Section: The Case When the Transition Function Is Knownmentioning

confidence: 99%

“…2 for some examples). Among these studies, a relevant line of research is along bandit learning [18,19,20,21]. More recently, switching costs have received considerable attention in more general RL settings [5,9,22,11].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Near-Optimal Adversarial Reinforcement Learning with Switching Costs

Shi¹,

Liang²,

Shroff³

2023

Preprint

View full text Add to dashboard Cite

Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient β that is strictly positive and is independent of T ) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than Ω((HSA) 1/3 T 2/3 ), where T , S, A and H are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on T is Õ( √ T )) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of Õ(H 1/3 ) when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Our Contributionsmentioning

confidence: 99%