2021
DOI: 10.48550/arxiv.2101.00494
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Abstract: Many real-world applications, such as those in medical domains, recommendation systems, etc, can be formulated as large state space reinforcement learning problems with only a small budget of the number of policy changes, i.e., low switching cost. This paper focuses on the linear Markov Decision Process (MDP) recently studied in Yang and Wang [2019a], Jin et al. [2019] where the linear function approximation is used for generalization on the large state space. We present the first algorithm for linear MDP wit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

1
28
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 12 publications
(29 citation statements)
references
References 23 publications
1
28
0
Order By: Relevance
“…This shows the desired near-optimality guarantee for π whenever ε ≤ min h −2.5 ⋆ , C partial /S and the number of episodes n satisfies (17). This proves Theorem 4.…”
supporting
confidence: 65%
See 1 more Smart Citation
“…This shows the desired near-optimality guarantee for π whenever ε ≤ min h −2.5 ⋆ , C partial /S and the number of episodes n satisfies (17). This proves Theorem 4.…”
supporting
confidence: 65%
“…Bridging online and offline RL Kalashnikov et al [26] observed empirically that the performance of policies trained purely from offline data can be improved considerably by a small amount of additional online fine-tuning. A recent line of work studied low switching cost RL [6,62,17,53]which forbits online RL algorithms from switching its policy too often-as an interpolation between the online and offline settings. The same problem is also studied empirically as deployment-efficient RL [36,46].…”
Section: Related Workmentioning
confidence: 99%
“…When F is the class of d-dimensional linear functions, the global switching cost bound given in Theorem 1 is O(d 2 H), which is worse than the O(dH) bound given in Gao et al [2021]. However, for linear functions, our sampling procedure is equivalent to the online leverage score sampling [Cohen et al, 2016], and therefore, by using the analysis in [Cohen et al, 2016] which is specific to the linear setting, the switching cost bound can be improved to O(dH), matching the bound given in Gao et al [2021]. Using the same technique, our regret bound can be improved to O( √ d 3 H 3 T ) in the linear setting, matching the bound given in Jin et al [2020b], Gao et al [2021].…”
Section: Theoretical Guarantee and The Analysismentioning
confidence: 82%
“…However, for linear functions, our sampling procedure is equivalent to the online leverage score sampling [Cohen et al, 2016], and therefore, by using the analysis in [Cohen et al, 2016] which is specific to the linear setting, the switching cost bound can be improved to O(dH), matching the bound given in Gao et al [2021]. Using the same technique, our regret bound can be improved to O( √ d 3 H 3 T ) in the linear setting, matching the bound given in Jin et al [2020b], Gao et al [2021]. Now we present the major steps for proving Theorem 1 to highlight the technical novelties and difficulties in the analysis.…”
Section: Theoretical Guarantee and The Analysismentioning
confidence: 99%
“…Algorithms for regret minimization Regret Switching cost UCB2-Bernstein [Bai et al, 2019] O( √ H 3 SAT ) Local: O(H 3 SA log T ) UCB-Advantage [Zhang et al, 2020c] O( √ H 2 SAT ) Local: O(H 2 SA log T ) Algorithm 1 in [Gao et al, 2021] O(…”
mentioning
confidence: 99%