2022
DOI: 10.48550/arxiv.2202.06385
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

Abstract: We study the problem of reinforcement learning (RL) with low (policy) switching costa problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of O( √ H 4 S 2 AT ) while requiring a switching cost of O(HSA log log T ). This is an exponential improvement over the best-known switching cost O(H 2 SA… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(7 citation statements)
references
References 5 publications
0
7
0
Order By: Relevance
“…Specifically, for tabular MDP, [5] and [39] proposed RL algorithms that attain an Õ √ H α SAT • ln T SA δ regret with probability 1 − δ, by incurring O (H α SA ln T ) switching costs, where α = 3 and 2, respectively. Recently, [11] obtained a similar Õ( √ T ) regret with probability 1 − δ, by incurring O (HSA ln ln T ) switching costs. Moreover, for linear MDP (with ddimensional feature space), [9] and [22]…”
Section: Switching Costsmentioning
confidence: 92%
See 4 more Smart Citations
“…Specifically, for tabular MDP, [5] and [39] proposed RL algorithms that attain an Õ √ H α SAT • ln T SA δ regret with probability 1 − δ, by incurring O (H α SA ln T ) switching costs, where α = 3 and 2, respectively. Recently, [11] obtained a similar Õ( √ T ) regret with probability 1 − δ, by incurring O (HSA ln ln T ) switching costs. Moreover, for linear MDP (with ddimensional feature space), [9] and [22]…”
Section: Switching Costsmentioning
confidence: 92%
“…Theorem 1 shows that in adversarial RL with switching costs, the dependency on T of the best achievable regret is at least Ω(T 2/3 ). Thus, the best achieved regret (whose dependency on T is Õ( √ T )) in static RL with switching costs (in [5,11], etc) as well as adversarial RL without switching costs (in [1,3], etc) is no longer achievable. This demonstrates the fundamental challenge of switching costs in adversarial RL, and it is expected that new challenges will arise when developing provably efficient algorithms.…”
Section: A Lower Boundmentioning
confidence: 99%
See 3 more Smart Citations