Near-optimal PAC bounds for discounted MDPs

Lattimore, Tor; Hutter, Marcus

doi:10.1016/j.tcs.2014.09.029

Cited by 34 publications

(40 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our lower bound instance is simplified from the instances in Azar et al [2013], Lattimore and Hutter [2014], Pananjady and Wainwright [2020].…”

Section: C1 Lower Bound Of Offline Evaluationmentioning

confidence: 99%

Nearly Horizon-Free Offline Reinforcement Learning

Ren¹,

Li²,

Dai³

et al. 2021

Preprint

View full text Add to dashboard Cite

We revisit offline reinforcement learning on episodic time-homogeneous tabular Markov Decision Processes with S states, A actions and planning horizon H. Given the collected N episodes data with minimum cumulative reaching probability d m , we obtain the first set of nearly H-free sample complexity bounds for evaluation and planning using the empirical MDPs:• For the offline evaluation, we obtain an Õ 1 N dm error rate, which matches the lower bound and does not have additional dependency on poly (S, A) in higher-order term, that is different from previous works .• For the offline policy optimization, we obtain an Õ 1 N dm + S N dm error rate, improving upon the best known result by Cui and Yang [2020], which has additional H and S factors in the main term.Furthermore, this bound approaches the Ω 1 N dm lower bound up to logarithmic factors and a high-order term.To the best of our knowledge, these are the first set of nearly horizon-free bounds in offline reinforcement learning.

show abstract

“…Our lower bound instance is simplified from the instances in Azar et al [2013], Lattimore and Hutter [2014], Pananjady and Wainwright [2020].…”

Section: C1 Lower Bound Of Offline Evaluationmentioning

confidence: 99%

Nearly Horizon-Free Offline Reinforcement Learning

Ren¹,

Li²,

Dai³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Theorem IV.6. (Theorem 1, [110] and Theorem 11, [111]) In both the generative and online sampling models, for and δ small enough, there exists an MDP learning a sample complexity in Ω( nxnu 2 (1−γ) 3 log( nx δ )) (where γ denotes the discount factor).…”

Section: B Discounted Mdpsmentioning

confidence: 99%

“…2) The price of model-free approaches: Some model-based algorithms are known to match the minimax sample complexity lower bound. In the online sampling setting, the authors of [111] presents UCRL(γ), an extension of UCRL for discounted costs, and establish a minimax sample complexity upper bound matching the above lower bound. UCRL(γ) consists in deriving upper confidence bounds for the MDP parameters, and in selecting action optimistically (this can lead to important computational issues).…”

Section: B Discounted Mdpsmentioning

confidence: 99%

From self-tuning regulators to reinforcement learning and back again

Matni

Proutière

Rantzer

et al. 2019

2019 IEEE 58th Conference on Decision and Control (CDC)

View full text Add to dashboard Cite

Machine and reinforcement learning (RL) are being applied to plan and control the behavior of autonomous systems interacting with the physical world -examples include self-driving vehicles, distributed sensor networks, and agile robots. However, if machine learning is to be applied in these new settings, the resulting algorithms must come with the reliability, robustness, and safety guarantees that are hallmarks of the control theory literature, as failures could be catastrophic. Thus, as RL algorithms are increasingly and more aggressively deployed in safety critical settings, it is imperative that control theorists be part of the conversation. The goal of this tutorial paper is to provide a jumping off point for control theorists wishing to work on RL related problems by covering recent advances in bridging learning and control theory, and by placing these results within the appropriate historical context of the system identification and adaptive control literatures.• Section II: provides an extensive literature review of work spanning classical and modern results in system identification, adaptive control, and RL.• Section III: introduces the fundamental problem and performance metrics considered in RL, and relates them to examples familiar to the controls community.• Section IV: provides a survey of contemporary results for problems with finite state and action spaces. • Section V: shows how system estimates and error bounds can be incorporated into model-based self-tuning regulators with finite-time performance guarantees.• Section VI: presents guarantees for model-free methods, and shows that a complexity gap exists between model-based and model-free methods. II. LITERATURE REVIEWThe results we present in this paper draw heavily from three broad areas of control and learning theory: system identification, adaptive control, and approximate dynamic programming (ADP) or, as it has come to be known, reinforcement learning. Each of these areas has a long and rich history and a general literature review is outside the scope of this tutorial. Below we will instead emphasize pointers to good textbooks and survey papers, before giving a more careful account of recent work.1) System Identification: The estimation of system behavior from input/output experiments has a well-developed theory dating back to the 1960s, particularly in the case of linear-time-invariant systems. Standard reference texts on the topic include [6], [8], [9], [10]. The success of discrete time series analysis by Box and Jenkins [11] provided an early impetus for the extension of these methods to the controlled system setting. Important connections to information theory were established by Akaike [12]. The rise of robust control in the 1980s further inspired system identification procedures, wherein model errors were optimized under the assumption of adversarial noise processes [13]. Another important step was the development of subspace methods [14], which became a powerful tool for identification of multi-input multi-output systems.2) Adaptive...

show abstract

“…In particular, (Jaksch et al, 2010) provided a regret lower bound Ω( √ HSAT ) for H-horizon MDP. There is also a line of works studying the sample complexity of obtaining a value or policy that is at most ǫ-suboptimal (Kakade, 2003;Strehl et al, 2006Strehl et al, , 2009Szita and Szepesvári, 2010;Lattimore and Hutter, 2014;Azar et al, 2013;Dann and Brunskill, 2015;Sidford et al, 2018). The optimal sample complexity for finding an ǫoptimal policy is O |S||A|(1 − γ) −2 ǫ −2 (Sidford et al, 2018) for a discounted MDP with discount factor γ.…”

Section: Related Literaturementioning

confidence: 99%

Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

Yang,

Wang

2019

Preprint

View full text Add to dashboard Cite

Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon H. In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound O H 2 d log T √ T where d is the number of features. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized MatrixRL satisfies a regret bound O H 2 d log T √ T , where d is the effective dimension of the kernel space. To our best knowledge, for RL using features or kernels, our results are the first regret bounds that are near-optimal in time T and dimension d (or d) and polynomial in the planning horizon H.

show abstract

Near-optimal PAC bounds for discounted MDPs

Cited by 34 publications

References 6 publications

Nearly Horizon-Free Offline Reinforcement Learning

Nearly Horizon-Free Offline Reinforcement Learning

From self-tuning regulators to reinforcement learning and back again

Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

Contact Info

Product

Resources

About