Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

Feng, Fei; Wang, Ruosong; Yin, Wotao; Du, Simon S.; Yang, Lin F.

doi:10.48550/arxiv.2003.06898

Cited by 4 publications

(4 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, many works have established that with additional assumptions, e.g. low-rankness of the transition, functions approximations for Q-functions, etc, the sample complexity does not depend on |S| [Li et al, 2011, Wen and Van Roy, 2017, Krishnamurthy et al, 2016, Jiang et al, 2017, Dann et al, 2018, Du et al, 2019b, Feng et al, 2020, Du et al, 2019c, Zhong et al, 2019, Jin et al, 2019, Du et al, 2019a, Roy and Dong, 2019, Lattimore and Szepesvari, 2019, Zanette et al, 2020. 5 However, to our knowledge, the sample complexity of all these work scales polynomially with H with the only exceptions to require the transition being deterministic [Wen andVan Roy, 2017, Du et al, 2020].…”

Section: Discussion and Further Open Problemsmentioning

confidence: 99%

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Wang¹,

Du²,

Yang³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is ε near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense.Our analysis introduces two ideas: (i) the construction of an ε-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.

show abstract

Section: Discussion and Further Open Problemsmentioning

confidence: 99%

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Wang¹,

Du²,

Yang³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…VALOR (Dann et al, 2018), PCID (Du et al, 2019a), HOMER (Misra et al, 2020), RegRL , and the approach from Feng et al (2020) are algorithms for block MDPs which is a more restricted setting than low-rank MDPs. These works require additional assumptions such as deterministic transitions (Dann et al, 2018), reachability (Misra et al, 2020;Du et al, 2019a), strong Bellman completion , and strong unsupervised learning oracles (Feng et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Representation Learning for Online and Offline RL in Low-rank MDPs

Uehara

Zhang

Sun

2021

Preprint

View full text Add to dashboard Cite

This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE(Agarwal et al., 2020b)--the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB-Upper Confidence Bound driven REPresentation learning for RL, which significantly improves the sample complexity from O(A 9 d 7 /( 10 (1with d being the rank of the transition matrix (or dimension of the ground truth representation), A being the number of actions, and γ being the discount factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explorethen-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline data distribution.

show abstract

“…However, for real-world problems, the state space is often large, so we need to use function approximation. Developing provably efficient algorithms for large state space RL problems is a hot topic recently [Wen and Roy, 2013, Li et al, 2011, Du et al, 2019a, Krishnamurthy et al, 2016, Jiang et al, 2017, Dann et al, 2018, Du et al, 2019b, Sun et al, 2018, Du et al, 2019c, Feng et al, 2020,b, Jin et al, 2019, Zanette et al, 2019a, Wang et al, 2020b, Cai et al, 2019, Ayoub et al, 2020. These works are based on different assumptions.…”

Section: Related Workmentioning

confidence: 99%

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Gao¹,

Xie²,

Du³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Many real-world applications, such as those in medical domains, recommendation systems, etc, can be formulated as large state space reinforcement learning problems with only a small budget of the number of policy changes, i.e., low switching cost. This paper focuses on the linear Markov Decision Process (MDP) recently studied in Yang and Wang [2019a], Jin et al. [2019] where the linear function approximation is used for generalization on the large state space. We present the first algorithm for linear MDP with a low switching cost. Our algorithm achieves an ‹ O Ä√ d 3 H 4 K ä regret bound with a near-optimal O (dH log K) global switching cost where d is the feature dimension, H is the planning horizon and K is the number of episodes the agent plays. Our regret bound matches the best existing polynomial algorithm by and our switching cost is exponentially smaller than theirs. When specialized to tabular MDP, our switching cost bound improves those in Bai et al. [2019], Zhang et al. [2020b. We complement our positive result with an Ω (dH/ log d) global switching cost lower bound for any no-regret algorithm.

show abstract

Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

Cited by 4 publications

References 32 publications

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Representation Learning for Online and Offline RL in Low-rank MDPs

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Contact Info

Product

Resources

About