A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Gao, Minbo; Xie, Tianle; Du, Simon S.; Yang, Lin F.

doi:10.48550/arxiv.2101.00494

Cited by 12 publications

(29 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This shows the desired near-optimality guarantee for π whenever ε ≤ min h −2.5 ⋆ , C partial /S and the number of episodes n satisfies (17). This proves Theorem 4.…”

supporting

confidence: 65%

“…Bridging online and offline RL Kalashnikov et al [26] observed empirically that the performance of policies trained purely from offline data can be improved considerably by a small amount of additional online fine-tuning. A recent line of work studied low switching cost RL [6,62,17,53]which forbits online RL algorithms from switching its policy too often-as an interpolation between the online and offline settings. The same problem is also studied empirically as deployment-efficient RL [36,46].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Xie

Jiang

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" µ close to the optimal policy π ⋆ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with S states, A actions, and horizon length H. We first design a sharp offline reduction algorithmwhich simply executes µ and runs offline policy optimization on the collected dataset-that finds an ε near-optimal policy within O(H 3 SC ⋆ /ε 2 ) episodes, where C ⋆ is the single-policy concentrability coefficient between µ and π ⋆ . This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an Ω(H 3 S min{C ⋆ , A}/ε 2 ) sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that-perhaps surprisingly-the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use µ. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where µ only satisfies concentrability partially up to a certain time step. Overall, our results offer a quantitative understanding on the benefit of a good reference policy, and make a step towards bridging offline and online RL.

show abstract

“…This shows the desired near-optimality guarantee for π whenever ε ≤ min h −2.5 ⋆ , C partial /S and the number of episodes n satisfies (17). This proves Theorem 4.…”

supporting

confidence: 65%

Section: Related Workmentioning

confidence: 99%

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Xie

Jiang

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…When F is the class of d-dimensional linear functions, the global switching cost bound given in Theorem 1 is O(d 2 H), which is worse than the O(dH) bound given in Gao et al [2021]. However, for linear functions, our sampling procedure is equivalent to the online leverage score sampling [Cohen et al, 2016], and therefore, by using the analysis in [Cohen et al, 2016] which is specific to the linear setting, the switching cost bound can be improved to O(dH), matching the bound given in Gao et al [2021]. Using the same technique, our regret bound can be improved to O( √ d 3 H 3 T ) in the linear setting, matching the bound given in Jin et al [2020b], Gao et al [2021].…”

Section: Theoretical Guarantee and The Analysismentioning

confidence: 82%

“…However, for linear functions, our sampling procedure is equivalent to the online leverage score sampling [Cohen et al, 2016], and therefore, by using the analysis in [Cohen et al, 2016] which is specific to the linear setting, the switching cost bound can be improved to O(dH), matching the bound given in Gao et al [2021]. Using the same technique, our regret bound can be improved to O( √ d 3 H 3 T ) in the linear setting, matching the bound given in Jin et al [2020b], Gao et al [2021]. Now we present the major steps for proving Theorem 1 to highlight the technical novelties and difficulties in the analysis.…”

Section: Theoretical Guarantee and The Analysismentioning

confidence: 99%

Online Sub-Sampling for Reinforcement Learning with General Function Approximation

Kong¹,

Salakhutdinov²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Designing provably efficient algorithms with general function approximation is an important open problem in reinforcement learning. Recently, Wang et al. [2020c] establish a value-based algorithm with general function approximation that enjoys O(poly(dH) √ K) 1 regret bound, where d depends on the complexity of the function class, H is the planning horizon, and K is the total number of episodes. However, their algorithm requires Ω(K) computation time per round, rendering the algorithm inefficient for practical use. In this paper, by applying online sub-sampling techniques, we develop an algorithm that takes O(poly(dH)) computation time per round on average, and enjoys nearly the same regret bound. Furthermore, the algorithm achieves low switching cost, i.e., it changes the policy only O(poly(dH)) times during its execution, making it appealing to be implemented in real-life scenarios. Moreover, by using an upper-confidence based exploration-driven reward function, the algorithm provably explores the environment in the reward-free setting. In particular, after O(poly(dH))/ǫ 2 rounds of exploration, the algorithm outputs an ǫ-optimal policy for any given reward function.

show abstract

“…Algorithms for regret minimization Regret Switching cost UCB2-Bernstein [Bai et al, 2019] O( √ H 3 SAT ) Local: O(H 3 SA log T ) UCB-Advantage [Zhang et al, 2020c] O( √ H 2 SAT ) Local: O(H 2 SA log T ) Algorithm 1 in [Gao et al, 2021] O(…”

mentioning

confidence: 99%

Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

Qiao¹,

Yin²,

Min³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study the problem of reinforcement learning (RL) with low (policy) switching costa problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of O( √ H 4 S 2 AT ) while requiring a switching cost of O(HSA log log T ). This is an exponential improvement over the best-known switching cost O(H 2 SA log T ) among existing methods with O(poly(H, S, A) √ T ) regret. In the above, S, A denotes the number of states and actions in an H-horizon episodic Markov Decision Process model with unknown transitions, and T is the number of steps. We also prove an information-theoretical lower bound which says that a switching cost of Ω(HSA) is required for any no-regret algorithm. As a byproduct, our new algorithmic techniques allow us to derive a reward-free exploration algorithm with an optimal switching cost of O(HSA).

show abstract

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Cited by 12 publications

References 23 publications

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Online Sub-Sampling for Reinforcement Learning with General Function Approximation

Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

Contact Info

Product

Resources

About