Online Sub-Sampling for Reinforcement Learning with General Function Approximation

Kong, Dingwen; Salakhutdinov, Ruslan; Wang, Ruosong; Yang, Lin F.

doi:10.48550/arxiv.2106.07203

Cited by 8 publications

(43 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Proposed by Russo and Van Roy (2013), eluder dimension has become a widely-used concept to characterize the complexity of different function classes in bandits and RL Ayoub et al, 2020;Jin et al, 2021;Kong et al, 2021). In this work, we define eluder dimension to characterize the complexity of the function F :…”

Section: A2 Eluder Dimensionmentioning

confidence: 99%

“…This corresponds to the last term (KD) in Inq 35. Therefore, to design efficient algorithm with near-optimal regret in the infinite-horizon setting, the algorithm should maintain lowswitching property (Bai et al, 2019;Kong et al, 2021). Taking inspiration from the recent work that studies efficient exploration with low switching cost in episodic setting (Kong et al, 2021), we define the importance score, sup f1,f2∈F f1−f2 2 Znew f1−f2 2 Z +α , as a measure of the importance for new samples collected in current episode, and only update the optimistic model and the policy when the importance score is greater than 1.…”

Section: C3 Infinite Simulator Classmentioning

confidence: 99%

See 1 more Smart Citation

Understanding Domain Randomization for Sim-to-real Transfer

Chen¹,

Hu²,

Jin³

et al. 2021

Preprint

View full text Add to dashboard Cite

Reinforcement learning encounters many challenges when applied directly in the real world. Sim-to-real transfer is widely used to transfer the knowledge learned from simulation to the real world. Domain randomization-one of the most popular algorithms for sim-to-real transfer-has been demonstrated to be effective in various tasks in robotics and autonomous driving. Despite its empirical successes, theoretical understanding on why this simple algorithm works is limited. In this paper, we propose a theoretical framework for sim-to-real transfers, in which the simulator is modeled as a set of MDPs with tunable parameters (corresponding to unknown physical parameters such as friction). We provide sharp bounds on the sim-to-real gap-the difference between the value of policy returned by domain randomization and the value of an optimal policy for the real world. We prove that sim-to-real transfer can succeed under mild conditions without any real-world training samples. Our theory also highlights the importance of using memory (i.e., history-dependent policies) in domain randomization. Our proof is based on novel techniques that reduce the problem of bounding the sim-to-real gap to the problem of designing efficient learning algorithms for infinite-horizon MDPs, which we believe are of independent interest.

show abstract

Section: A2 Eluder Dimensionmentioning

confidence: 99%

Section: C3 Infinite Simulator Classmentioning

confidence: 99%

Understanding Domain Randomization for Sim-to-real Transfer

Chen¹,

Hu²,

Jin³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Beyond linear function approximation, in the finite-horizon setting researchers also start considering theoretical guarantees for general function approximation (Wang et al, 2020;Ishfaq et al, 2021;Kong et al, 2021). The study for SSP, which again is a strict generalization of the finite-horizon problems and might be a better model for many applications, falls behind in this regard, motivating us to explore in this direction with the goal of providing a more complete picture at least for linear function approximation.…”

Section: Related Workmentioning

confidence: 99%

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Chen¹,

Jain²,

Luo³

2021

Preprint

View full text Add to dashboard Cite

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound Õ( d 3 B 2 ⋆ T ⋆ K), where d is the dimension of the feature space, B ⋆ and T ⋆ are upper bounds of the expected costs and hitting time of the optimal policy respectively, and K is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order O, where gap min is the minimum sub-optimality gap and c min is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first "horizon-free" regret bound Õ(d 3.5 B ⋆ √ K) with no polynomial dependency on T ⋆ or 1/c min , almost matching the Ω(dB ⋆ √ K) lower bound from (Min et al., 2021).

show abstract

“…Nonlinear generalizations: Some nonlinear generalizations of LMDPs have been proposed, such as the case where the state-action value function belongs to a class of bounded eluder dimension [Russo and Van Roy, 2013] or can be represented by a kernel function or neural network. While such generalization is important, to our knowledge, these works (see, e.g., Chowdhury and Oliveira [2020], Ishfaq et al [2021], Kong et al [2021], Wang et al [2019, Yang et al [2020b,a]) fail to improve over Jin et al [2020], Zanette et al [2020b] in terms of (P1), (P2), or (P3) (or regret).…”

Section: Related Workmentioning

confidence: 99%

Improved Algorithms for Misspecified Linear Markov Decision Processes

Vial

Parulekar²,

Shakkottai³

et al. 2021

Preprint

View full text Add to dashboard Cite

For the misspecified linear Markov decision process (MLMDP) model of Jin et al. [2020], we propose an algorithm with three desirable properties. (P1) Its regret after K episodes scales as K max{ε mis , ε tol }, where ε mis is the degree of misspecification and ε tol is a user-specified error tolerance. (P2) Its space and per-episode time complexities remain bounded as K → ∞. (P3) It does not require ε mis as input. To our knowledge, this is the first algorithm satisfying all three properties. For concrete choices of ε tol , we also improve existing regret bounds (up to log factors) while achieving either (P2) or (P3) (existing algorithms satisfy neither). At a high level, our algorithm generalizes (to MLMDPs) and refines the Sup-Lin-UCB algorithm, which Takemura et al. [2021] recently showed satisfies (P3) in the contextual bandit setting.

show abstract

Online Sub-Sampling for Reinforcement Learning with General Function Approximation

Cited by 8 publications

References 24 publications

Understanding Domain Randomization for Sim-to-real Transfer

Understanding Domain Randomization for Sim-to-real Transfer

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Improved Algorithms for Misspecified Linear Markov Decision Processes

Contact Info

Product

Resources

About